Postgres is not able to handle more than 4k tables!?
I want to explain one bad situation we have encountered with one of our
customers.
There are ~5000 tables in their database. And what is worse - most of
them are actively used.
Then several flaws of Postgres make their system almost stuck.
Autovacuum is periodically processing all this 5k relations (because
them are actively updated).
And as far as most of this tables are small enough autovacuum complete
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in
short period of time.
There are several thousand clients, most of which are executing complex
queries.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.
As a result resetState is set to true, forcing backends to invalidate
their caches.
So most of backends loose there cached metadata and have to access
system catalog trying to reload it.
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:
#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Koda!
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.
But there are several places in Postgres where it tries to hold all
partition locks (for example in deadlock detector).
Definitely if NUM_LOCK_PARTITIONS > MAX_SIMUL_LWLOCKS we get this error.
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.
To avoid division, we can specify log2 of this values, so shift can be
used instead.
And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS +
NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
There are several thousand clients, most of which are executing complex
queries.
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.
No, this isn't a problem. Or at least you haven't shown a reason to
think it is. Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:
#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Really?
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.
Seems like self-inflicted damage. I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.
I seriously doubt we'd do that.
regards, tom lane
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Autovacuum is periodically processing all this 5k relations (because
them are actively updated).
And as far as most of this tables are small enough autovacuum complete
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in
short period of time.
How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)? It's available since PG 12. It causes autovacuum to not truncate the relation. It's the relation truncation what produces those shared invalidation messages.
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Koda!
The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter. It acquires Access Exclusive lock on the relation. Without the strong Access Exclusive lock, just running DML statements use the fast path locking, which doesn't acquire the lock manager partition lock.
The long lwlock wait is a sad story. The victim is probably exclusive lockers. When someone holds a shared lock on a lwlock, the exclusive locker has to wait. That's OK. However, if another share locker comes later, it acquires the lwlock even though there're waiting exclusive lockers. That's unfair, but this is the community decision.
Regards
Takayuki Tsunakawa
On 09.07.2020 03:49, tsunakawa.takay@fujitsu.com wrote:
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Autovacuum is periodically processing all this 5k relations (because
them are actively updated).
And as far as most of this tables are small enough autovacuum complete
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in
short period of time.How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)? It's available since PG 12. It causes autovacuum to not truncate the relation. It's the relation truncation what produces those shared invalidation messages.
Invalidation messages are also caused by statistic update:
#0 0x000055a85f4f5fd6 in RegisterCatcacheInvalidation (cacheId=49,
hashValue=715727843, dbId=12443)
at inval.c:483
#1 0x000055a85f4f4dc2 in PrepareToInvalidateCacheTuple
(relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70,
newtuple=0x7f4598e75ef8, function=0x55a85f4f5fc0
<RegisterCatcacheInvalidation>) at catcache.c:1830
#2 0x000055a85f4f6b21 in CacheInvalidateHeapTuple
(relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70,
newtuple=0x7f4598e75ef8) at inval.c:1149
#3 0x000055a85f016372 in heap_update (relation=0x7f45a34ce5a0,
otid=0x7f4598e75efc,
newtup=0x7f4598e75ef8, cid=0, crosscheck=0x0, wait=1 '\001',
hufd=0x7ffc75bebcf0,
lockmode=0x7ffc75bebce8) at heapam.c:4245
#4 0x000055a85f016f98 in simple_heap_update (relation=0x7f45a34ce5a0,
otid=0x7f4598e75efc,
tup=0x7f4598e75ef8) at heapam.c:4490
#5 0x000055a85f153ec5 in update_attstats (relid=16384, inh=0 '\000',
natts=1, vacattrstats=0x55a860f0fba0)
at analyze.c:1619
#6 0x000055a85f151898 in do_analyze_rel (onerel=0x7f45a3480080,
options=98, params=0x55a860f0f028,
va_cols=0x0, acquirefunc=0x55a85f15264e <acquire_sample_rows>,
relpages=26549, inh=0 '\000',
in_outer_xact=0 '\000', elevel=13) at analyze.c:562
#7 0x000055a85f150be1 in analyze_rel (relid=16384,
relation=0x7ffc75bec370, options=98,
params=0x55a860f0f028, va_cols=0x0, in_outer_xact=0 '\000',
bstrategy=0x55a860f0f0b8) at analyze.c:257
#8 0x000055a85f1e1589 in vacuum (options=98, relation=0x7ffc75bec370,
relid=16384, params=0x55a860f0f028,
va_cols=0x0, bstrategy=0x55a860f0f0b8, isTopLevel=1 '\001') at
vacuum.c:320
#9 0x000055a85f2fd92a in autovacuum_do_vac_analyze (tab=0x55a860f0f020,
bstrategy=0x55a860f0f0b8)
at autovacuum.c:2874
#10 0x000055a85f2fcccb in do_autovacuum () at autovacuum.c:2374
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Koda!The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter. It acquires Access Exclusive lock on the relation. Without the strong Access Exclusive lock, just running DML statements use the fast path locking, which doesn't acquire the lock manager partition lock.
Looks like it is not true (at lest for PG9.6):
#0 0x00007fa6d30da087 in semop () from /lib64/libc.so.6
#1 0x0000000000682241 in PGSemaphoreLock
(sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387
#2 0x00000000006ec6eb in LWLockAcquire (lock=lock@entry=0x7f23b544f800,
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338
#3 0x00000000006e5560 in LockAcquireExtended
(locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1,
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0
'\000', reportMemoryError=reportMemoryError@entry=1 '\001',
locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962
#4 0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1)
at lmgr.c:113
#5 0x00000000004a9f55 in relation_open (relationId=87103837,
lockmode=lockmode@entry=1) at heapam.c:1131
#6 0x00000000004bdc66 in index_open (relationId=<optimized out>,
lockmode=lockmode@entry=1) at indexam.c:151
#7 0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758,
relationObjectId=72079078, inhparent=<optimized out>,
rel=rel@entry=0x3a2d460) at plancat.c:183
#8 0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758,
relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148
Please notice lockmode=1 (AccessShareLock)
The long lwlock wait is a sad story. The victim is probably exclusive lockers. When someone holds a shared lock on a lwlock, the exclusive locker has to wait. That's OK. However, if another share locker comes later, it acquires the lwlock even though there're waiting exclusive lockers. That's unfair, but this is the community decision.
Yes, I also think that it is the reason of the problem.
Alexandr Korotokov has implemented fair LW-Locks which eliminate such
kind of problems in some scenarios.
May it also can help here.
On 09.07.2020 00:35, Tom Lane wrote:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
There are several thousand clients, most of which are executing complex
queries.So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
It is not my problem - it is customer's problem.
Certainly the advice to use connection pooler is the first thing we have
proposed to the customer when we see such larger number of active backends.
Unfortunately it is not always possible (connection pooler is not
preseving session semantic).
This is why I have proposed builtin connection pooler for Postgres.
But it is different story.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.No, this isn't a problem. Or at least you haven't shown a reason to
think it is. Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).
Certainly cache overrun is not fatal.
But if most of backends are blocked in heap_open pf pg_attribute
relation then something is not ok with Postgres, isn't it?
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Really?
Planning time: 75698.602 ms
Execution time: 0.861 ms
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.Seems like self-inflicted damage. I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.
Looks like you try to convince me that such practice of hardcoding
constants in code and
not taken in account relation between them is good design pattern?
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.I seriously doubt we'd do that.
It's a pity, because such attention is one of the reasons why Postgres
is pgbench-oriented database showing good results at notebooks
but not at real systems running at power servers (NUMA, SSD, huge amount
of memory, large number of cores,...).
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Looks like it is not true (at lest for PG9.6):
#0 0x00007fa6d30da087 in semop () from /lib64/libc.so.6
#1 0x0000000000682241 in PGSemaphoreLock
(sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387
#2 0x00000000006ec6eb in LWLockAcquire
(lock=lock@entry=0x7f23b544f800,
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338
#3 0x00000000006e5560 in LockAcquireExtended
(locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1,
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0
'\000', reportMemoryError=reportMemoryError@entry=1 '\001',
locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962
#4 0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1)
at lmgr.c:113
#5 0x00000000004a9f55 in relation_open (relationId=87103837,
lockmode=lockmode@entry=1) at heapam.c:1131
#6 0x00000000004bdc66 in index_open (relationId=<optimized out>,
lockmode=lockmode@entry=1) at indexam.c:151
#7 0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758,
relationObjectId=72079078, inhparent=<optimized out>,
rel=rel@entry=0x3a2d460) at plancat.c:183
#8 0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758,
relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148Please notice lockmode=1 (AccessShareLock)
Ouch, there exists another sad hardcoded value: the number of maximum locks that can be acquired by the fast-path mechanism.
[LockAcquireExtended]
/*
* Attempt to take lock via fast path, if eligible. But if we remember
* having filled up the fast path array, we don't attempt to make any
* further use of it until we release some locks. It's possible that some
* other backend has transferred some of those locks to the shared hash
* table, leaving space free, but it's not worth acquiring the LWLock just
* to check. It's also possible that we're acquiring a second or third
* lock type on a relation we have already locked using the fast-path, but
* for now we don't worry about that case either.
*/
if (EligibleForRelationFastPath(locktag, lockmode) &&
FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND)
{
/*
* We allow a small number of "weak" relation locks (AccessShareLock,
* RowShareLock, RowExclusiveLock) to be recorded in the PGPROC structure
* rather than the main lock table. This eases contention on the lock
* manager LWLocks. See storage/lmgr/README for additional details.
*/
#define FP_LOCK_SLOTS_PER_BACKEND 16
16 looks easily exceeded even in a not-long OLTP transaction... especially the table is partitioned. I wonder if we're caught in the hell of lock manager partition lock contention without knowing it. I'm afraid other pitfalls are lurking when there are many relations.
Regards
Takayuki Tsunakawa
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
There are several thousand clients, most of which are executing complex
queries.So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.No, this isn't a problem. Or at least you haven't shown a reason to
think it is. Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).
Testing that it doesn't outright break and having it be decently
performant are two rather different things. I think we're talking more
about performance and not so much about if the system is outright broken
in this case.
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!Really?
Apparently, given the response down-thread.
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.Seems like self-inflicted damage. I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.
Perhaps not in the docs, but would be good to make note of it somewhere,
as I don't think it's really appropriate to assume these constants won't
ever change and whomever contemplates changing them would appreciate
knowing about other related values..
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.I seriously doubt we'd do that.
Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems.
As I understood the report, it was not "things completely fall over",
it was "performance gets bad". But let's get real. Unless the OP
has a machine with thousands of CPUs, trying to run this way is
counterproductive.
Perhaps in a decade or two such machines will be common enough that
it'll make sense to try to tune Postgres to run well on them. Right
now I feel no hesitation about saying "if it hurts, don't do that".
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems.As I understood the report, it was not "things completely fall over",
it was "performance gets bad". But let's get real. Unless the OP
has a machine with thousands of CPUs, trying to run this way is
counterproductive.
Right, the issue is that performance gets bad (or, really, more like
terrible...), and regardless of if it's ideal or not, lots of folks
actually do run PG with thousands of connections, and we know that at
start-up time because they've set max_connections to a sufficiently high
value to support doing exactly that.
Perhaps in a decade or two such machines will be common enough that
it'll make sense to try to tune Postgres to run well on them. Right
now I feel no hesitation about saying "if it hurts, don't do that".
I disagree that we should completely ignore these use-cases.
Thanks,
Stephen
Hi Stephen,
Thank you for supporting an opinion that it is the problems not only of
client system design (I agree it is not so good
idea to have thousands tables and thousands active backends) but also of
Postgres.
We have made more investigation and found out one more problem in
Postgres causing "invalidation storm".
There are some log living transactions which prevent autovacuum to do it
work and� remove dead tuples.
So autovacuum is started once and once again and each time did no
progress but updated statistics and so sent invalidation messages.
autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum
proceed huge number of tables and initiated large number of invalidation
messages which quite soon cause overflow of validation message buffers
for backends performing long OLAP queries.
It makes me think about two possible optimizations:
1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages
and reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then
most likely this fields are also changed. So we have to remove this relation
from relation cache in any case.
2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.
Thoughts?
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.
I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular
case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application
has to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At
least it should first terminate current connection using
pg_terminate_backend. I just want to notice that large number of
backends was not the core of the problem.
Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?
I think that optimal value of number of lock partitions should depend
not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of� MAX_SIMUL_LWLOCKS.
There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too
expensive and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS
falls down to 0).
Proposed patch for this problem was also rejected (once again - problem
can be reproduced only of powerful server with large number of cores).
On 09.07.2020 18:14, Tom Lane wrote:
As I understood the report, it was not "things completely fall over",
it was "performance gets bad". But let's get real. Unless the OP
has a machine with thousands of CPUs, trying to run this way is
counterproductive.
Sorry, that I was not clear. It is actually case when "things completely
fall over".
If query planning time takes several minutes and so user response time
is increased from seconds to hours,
then system becomes unusable, doesn't it?
Perhaps in a decade or two such machines will be common enough that
it'll make sense to try to tune Postgres to run well on them. Right
now I feel no hesitation about saying "if it hurts, don't do that".
Unfortunately we have not to wait for decade or two.
Postgres is faced with multiple problems at existed multiprocessor
systems (64, 96,.. cores).
And it is not even necessary to initiate thousands of connections: just
enough to load all this cores and let them compete for some
resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks
with zipfian distribution may illustrate this problems.
There were many proposed patches which help to improve this situation.
But as far as this patches increase performance only at huge servers
with large number of cores and show almost no
improvement (or even some degradation) at standard 4-cores desktops,
almost none of them were committed.
Consequently our customers have a lot of troubles trying to replace
Oracle with Postgres and provide the same performance at same
(quite good and expensive) hardware.
Hi Konstantin, a silly question: do you consider the workload you have as
well-optimized? Can it be optimized further? Reading this thread I have a
strong feeling that a very basic set of regular optimization actions is
missing here (or not explained): query analysis and optimization based on
pg_stat_statements (and, maybe pg_stat_kcache), some method to analyze the
state of the server in general, resource consumption, etc.
Do you have some monitoring that covers pg_stat_statements?
Before looking under the hood, I would use multiple pg_stat_statements
snapshots (can be analyzed using, say, postgres-checkup or pgCenter) to
understand the workload and identify the heaviest queries -- first of all,
in terms of total_time, calls, shared buffers reads/hits, temporary files
generation. Which query groups are Top-N in each category, have you looked
at it?
You mentioned some crazy numbers for the planning time, but why not to
analyze the picture holistically and see the overall numbers? Those queries
that have increased planning time, what their part of total_time, on the
overall picture, in %? (Unfortunately, we cannot see Top-N by planning time
in pg_stat_statements till PG13, but it doesn't mean that we cannot have
some good understanding of overall picture today, it just requires more
work).
If workload analysis & optimization was done holistically already, or not
possible due to some reason — pardon me. But if not and if your primary
goal is to improve this particular setup ASAP, then the topic could be
started in the -performance mailing list first, discussing the workload and
its aspects, and only after it's done, raised in -hackers. No?
On Thu, Jul 9, 2020 at 8:57 AM Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:
Show quoted text
Hi Stephen,
Thank you for supporting an opinion that it is the problems not only of
client system design (I agree it is not so good
idea to have thousands tables and thousands active backends) but also of
Postgres.We have made more investigation and found out one more problem in
Postgres causing "invalidation storm".
There are some log living transactions which prevent autovacuum to do it
work and remove dead tuples.
So autovacuum is started once and once again and each time did no
progress but updated statistics and so sent invalidation messages.
autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum
proceed huge number of tables and initiated large number of invalidation
messages which quite soon cause overflow of validation message buffers
for backends performing long OLAP queries.It makes me think about two possible optimizations:
1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages
and reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then
most likely this fields are also changed. So we have to remove this
relation
from relation cache in any case.2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.Thoughts?
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular
case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application
has to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At
least it should first terminate current connection using
pg_terminate_backend. I just want to notice that large number of
backends was not the core of the problem.Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?I think that optimal value of number of lock partitions should depend
not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of MAX_SIMUL_LWLOCKS.There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too
expensive and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS
falls down to 0).
Proposed patch for this problem was also rejected (once again - problem
can be reproduced only of powerful server with large number of cores).
On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.
This option looks good for me independently of the use case under
consideration. Long-running transactions are an old and well-known
problem. If we can skip some useless work here, let's do this.
------
Regards,
Alexander Korotkov
Greetings,
* Konstantin Knizhnik (k.knizhnik@postgrespro.ru) wrote:
It makes me think about two possible optimizations:
1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages and
reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then most
likely this fields are also changed. So we have to remove this relation
from relation cache in any case.
I realize this is likely to go over like a lead balloon, but the churn
in pg_class from updating reltuples/relpages has never seemed all that
great to me when just about everything else is so rarely changed, and
only through some user DDL action- and I agree that it seems like those
particular columns are more 'statistics' type of info and less info
about the definition of the relation. Other columns that do get changed
regularly are relfrozenxid and relminmxid. I wonder if it's possible to
move all of those elsewhere- perhaps some to the statistics tables as
you seem to be alluding to, and the others to $somewhereelse that is
dedicated to tracking that information which VACUUM is primarily
concerned with.
2. Remember in relation info XID of oldest active transaction at the moment
of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID with
current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.Thoughts?
That sounds like an interesting optimization and I agree it'd be nice to
avoid the re-run of autovacuum when we can tell that there's not going
to be anything more we can do. As noted above, for my part, I think
it'd be nice to move that kind of ongoing maintenance/updates out of
pg_class, but just in general I agree with the idea to store that info
somewhere and wait until there's actually been progress in the global
xmin before re-running a vacuum on a table. If we can do that somewhere
outside of pg_class, I think that'd be better, but if no one is up for
that kind of a shake-up, then maybe we just put it in pg_class and deal
with the churn there.
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular case
shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application has
to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At least
it should first terminate current connection using pg_terminate_backend. I
just want to notice that large number of backends was not the core of the
problem.
Yeah, this is all getting back to the fact that we don't have an
acceptance criteria or anything like that, where we'd actually hold off
on new connections/queries being allowed in while other things are
happening. Of course, a connection pooler would address this (and you
could use one and have it still look exactly like PG, if you use, say,
pgbouncer in session-pooling mode, but then you need to have the
application drop/reconnect and not do its own connection pooling..), but
it'd be nice to have something in core for this.
Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?I think that optimal value of number of lock partitions should depend not on
number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
A GUC for 'number of CPUs' doesn't seem like a bad option to have. How
to make that work well may be challenging though.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of MAX_SIMUL_LWLOCKS.
Sure.
There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too expensive
and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS falls
down to 0).
Proposed patch for this problem was also rejected (once again - problem can
be reproduced only of powerful server with large number of cores).
That does sound like something that would be good to improve on, though
I haven't looked at the proposed patch or read the associated thread, so
I'm not sure I can really comment on its rejection specifically.
Thanks,
Stephen
On 09.07.2020 19:19, Nikolay Samokhvalov wrote:
Hi Konstantin, a silly question: do you consider the workload you have
as well-optimized? Can it be optimized further? Reading this thread I
have a strong feeling that a very basic set of regular optimization
actions is missing here (or not explained): query analysis and
optimization based on pg_stat_statements (and, maybe pg_stat_kcache),
some method to analyze the state of the server in general, resource
consumption, etc.Do you have some monitoring that covers pg_stat_statements?
Before looking under the hood, I would use multiple pg_stat_statements
snapshots (can be analyzed using, say, postgres-checkup or pgCenter)
to understand the workload and identify the heaviest queries -- first
of all, in terms of total_time, calls, shared buffers reads/hits,
temporary files generation. Which query groups are Top-N in each
category, have you looked at it?You mentioned some crazy numbers for the planning time, but why not to
analyze the picture holistically and see the overall numbers? Those
queries that have increased planning time, what their part of
total_time, on the overall picture, in %? (Unfortunately, we cannot
see Top-N by planning time in pg_stat_statements till PG13, but it
doesn't mean that we cannot have some good understanding of overall
picture today, it just requires more work).If workload analysis & optimization was done holistically already, or
not possible due to some reason — pardon me. But if not and if your
primary goal is to improve this particular setup ASAP, then the topic
could be started in the -performance mailing list first, discussing
the workload and its aspects, and only after it's done, raised in
-hackers. No?
Certainly, both we and customer has made workload analysis & optimization.
It is not a problem of particular queries, bad plans, resource
exhaustion,...
Unfortunately there many scenarios when Postgres demonstrates not
gradual degrade of performance with increasing workload,
but "snow avalanche" whennegative feedback cause very fastparalysis of
the system.
This case is just one if this scenarios. It is hard to say for sure what
triggers the avalanche... Long living transaction, huge number of tables,
aggressive autovacuum settings... But there is cascade of negative
events which cause system which normally function for months to stop
working at all.
In this particular case we have the following chain:
- long living transaction cause autovacuum to send a lot of invalidation
message
- this messages cause overflow of invalidation message queues, forcing
backens to invalidate their caches and reload from catalog.
- too small value of fastpath lock cache cause many concurrent accesses
to shared lock hash
- contention for LW-lock caused by small number of lock partition cause
starvation
Great idea.
In addition to this, it would be good to consider another optimization for
the default transaction isolation level: making autovacuum to clean dead
tuples in relations that are not currently used in any transaction and when
there are no IN_PROGRESS transactions running at RR or S level (which is a
very common case because RC is the default level and this is what is
actively used in heavily loaded OLTP systems which often suffer from
long-running transactions). I don't know the details of how easy it would
be to implement, but it always wondered that autovacuum has the global XID
"horizon".
With such an optimization, the "hot_standby_feedback=on" mode could be
implemented also more gracefully, reporting "min(xid)" for ongoing
transactions on standbys separately for RC and RR/S levels.
Without this, we cannot have good performance for HTAP use cases for
Postgres – the presence of just a small number of long-running
transactions, indeed, is known to kill the performance of OLTP workloads
quickly. And leading to much faster bloat growth than it could be.
However, maybe I'm wrong in these considerations, or it's impossible / too
difficult to implement.
On Thu, Jul 9, 2020 at 9:38 AM Alexander Korotkov <aekorotkov@gmail.com>
wrote:
Show quoted text
On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.This option looks good for me independently of the use case under
consideration. Long-running transactions are an old and well-known
problem. If we can skip some useless work here, let's do this.------
Regards,
Alexander Korotkov
Greetings,
We generally prefer that you don't top-post on these lists.
* Nikolay Samokhvalov (samokhvalov@gmail.com) wrote:
In addition to this, it would be good to consider another optimization for
the default transaction isolation level: making autovacuum to clean dead
tuples in relations that are not currently used in any transaction and when
there are no IN_PROGRESS transactions running at RR or S level (which is a
very common case because RC is the default level and this is what is
actively used in heavily loaded OLTP systems which often suffer from
long-running transactions). I don't know the details of how easy it would
be to implement, but it always wondered that autovacuum has the global XID
"horizon".
Yeah, I've had thoughts along the same lines, though I had some ideas
that we could actually manage to support it even with RR (at least...
not sure about serializable) by considering what tuples the transactions
in the system could actually see (eg: even with RR, a tuple created
after that transaction started and was then deleted wouldn't ever be
able to be seen and therefore could be cleaned up..). A further thought
on that was to only spend that kind of effort once a tuple had aged a
certain amount, though it depends a great deal on exactly what would
need to be done for this.
Unfortunately, anything in this area is likely to carry a good bit of
risk associated with it as VACUUM doing the wrong thing would be quite
bad.
Thanks,
Stephen
On 7/8/20 11:41 PM, Konstantin Knizhnik wrote:
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have
to be replaced with GUCs.
To avoid division, we can specify log2 of this values, so shift can be
used instead.
And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS +
NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.
Because I was involved in this particular case and don`t want it to
became a habit, I`m volunteering to test whatever patch this discussion
will produce.
--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Jul 9, 2020 at 10:00 PM Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
In addition to this, it would be good to consider another optimization for the default transaction isolation level: making autovacuum to clean dead tuples in relations that are not currently used in any transaction and when there are no IN_PROGRESS transactions running at RR or S level (which is a very common case because RC is the default level and this is what is actively used in heavily loaded OLTP systems which often suffer from long-running transactions). I don't know the details of how easy it would be to implement, but it always wondered that autovacuum has the global XID "horizon".
With such an optimization, the "hot_standby_feedback=on" mode could be implemented also more gracefully, reporting "min(xid)" for ongoing transactions on standbys separately for RC and RR/S levels.
Yes, the current way of calculation of dead tuples is lossy, because
we only rely on the oldest xid. However, if we would keep the oldest
snapshot instead of oldest xmin, long-running transactions wouldn't be
such a disaster. I don't think this is feasible with the current
snapshot model, because keeping full snapshots instead of just xmins
would bloat shared-memory structs and complicate computations. But
CSN can certainly support this optimization.
------
Regards,
Alexander Korotkov
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Unfortunately we have not to wait for decade or two.
Postgres is faced with multiple problems at existed multiprocessor
systems (64, 96,.. cores).
And it is not even necessary to initiate thousands of connections: just
enough to load all this cores and let them compete for some
resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks
with zipfian distribution may illustrate this problems.
I concur with you. VMs and bare metal machines with 100~200 CPU cores and TBs of RAM are already available even on public clouds. The users easily set max_connections to a high value like 10,000, create thousands or tens of thousands of relations, and expect it to go smoothly. Although it may be a horror for PG developers who know the internals well, Postgres has grown a great database to be relied upon.
Besides, I don't want people to think like "Postgres cannot scale up on one machine, so we need scale-out." I understand some form of scale-out is necessary for large-scale analytics and web-scale multitenant OLTP, but it would be desirable to be able to cover the OLTP workloads for one organization/region with the advances in hardware and Postgres leveraging those advances, without something like Oracle RAC.
There were many proposed patches which help to improve this situation.
But as far as this patches increase performance only at huge servers
with large number of cores and show almost no
improvement (or even some degradation) at standard 4-cores desktops,
almost none of them were committed.
Consequently our customers have a lot of troubles trying to replace
Oracle with Postgres and provide the same performance at same
(quite good and expensive) hardware.
Yeah, it's a pity that the shiny-looking patches from Postgres Pro (mostly from Konstantin san?) -- autoprepare, built-in connection pooling, fair lwlock, and revolutionary multi-threaded backend -- haven't gained hot atension.
Regards
Takayuki Tsunakawa