Postgres is not able to handle more than 4k tables!?
I want to explain one bad situation we have encountered with one of our
customers.
There are ~5000 tables in their database. And what is worse - most of
them are actively used.
Then several flaws of Postgres make their system almost stuck.
Autovacuum is periodically processing all this 5k relations (because
them are actively updated).
And as far as most of this tables are small enough autovacuum complete
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in
short period of time.
There are several thousand clients, most of which are executing complex
queries.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.
As a result resetState is set to true, forcing backends to invalidate
their caches.
So most of backends loose there cached metadata and have to access
system catalog trying to reload it.
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:
#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Koda!
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.
But there are several places in Postgres where it tries to hold all
partition locks (for example in deadlock detector).
Definitely if NUM_LOCK_PARTITIONS > MAX_SIMUL_LWLOCKS we get this error.
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.
To avoid division, we can specify log2 of this values, so shift can be
used instead.
And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS +
NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
There are several thousand clients, most of which are executing complex
queries.
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.
No, this isn't a problem. Or at least you haven't shown a reason to
think it is. Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:
#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Really?
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.
Seems like self-inflicted damage. I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.
I seriously doubt we'd do that.
regards, tom lane
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Autovacuum is periodically processing all this 5k relations (because
them are actively updated).
And as far as most of this tables are small enough autovacuum complete
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in
short period of time.
How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)? It's available since PG 12. It causes autovacuum to not truncate the relation. It's the relation truncation what produces those shared invalidation messages.
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Koda!
The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter. It acquires Access Exclusive lock on the relation. Without the strong Access Exclusive lock, just running DML statements use the fast path locking, which doesn't acquire the lock manager partition lock.
The long lwlock wait is a sad story. The victim is probably exclusive lockers. When someone holds a shared lock on a lwlock, the exclusive locker has to wait. That's OK. However, if another share locker comes later, it acquires the lwlock even though there're waiting exclusive lockers. That's unfair, but this is the community decision.
Regards
Takayuki Tsunakawa
On 09.07.2020 03:49, tsunakawa.takay@fujitsu.com wrote:
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Autovacuum is periodically processing all this 5k relations (because
them are actively updated).
And as far as most of this tables are small enough autovacuum complete
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in
short period of time.How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)? It's available since PG 12. It causes autovacuum to not truncate the relation. It's the relation truncation what produces those shared invalidation messages.
Invalidation messages are also caused by statistic update:
#0 0x000055a85f4f5fd6 in RegisterCatcacheInvalidation (cacheId=49,
hashValue=715727843, dbId=12443)
at inval.c:483
#1 0x000055a85f4f4dc2 in PrepareToInvalidateCacheTuple
(relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70,
newtuple=0x7f4598e75ef8, function=0x55a85f4f5fc0
<RegisterCatcacheInvalidation>) at catcache.c:1830
#2 0x000055a85f4f6b21 in CacheInvalidateHeapTuple
(relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70,
newtuple=0x7f4598e75ef8) at inval.c:1149
#3 0x000055a85f016372 in heap_update (relation=0x7f45a34ce5a0,
otid=0x7f4598e75efc,
newtup=0x7f4598e75ef8, cid=0, crosscheck=0x0, wait=1 '\001',
hufd=0x7ffc75bebcf0,
lockmode=0x7ffc75bebce8) at heapam.c:4245
#4 0x000055a85f016f98 in simple_heap_update (relation=0x7f45a34ce5a0,
otid=0x7f4598e75efc,
tup=0x7f4598e75ef8) at heapam.c:4490
#5 0x000055a85f153ec5 in update_attstats (relid=16384, inh=0 '\000',
natts=1, vacattrstats=0x55a860f0fba0)
at analyze.c:1619
#6 0x000055a85f151898 in do_analyze_rel (onerel=0x7f45a3480080,
options=98, params=0x55a860f0f028,
va_cols=0x0, acquirefunc=0x55a85f15264e <acquire_sample_rows>,
relpages=26549, inh=0 '\000',
in_outer_xact=0 '\000', elevel=13) at analyze.c:562
#7 0x000055a85f150be1 in analyze_rel (relid=16384,
relation=0x7ffc75bec370, options=98,
params=0x55a860f0f028, va_cols=0x0, in_outer_xact=0 '\000',
bstrategy=0x55a860f0f0b8) at analyze.c:257
#8 0x000055a85f1e1589 in vacuum (options=98, relation=0x7ffc75bec370,
relid=16384, params=0x55a860f0f028,
va_cols=0x0, bstrategy=0x55a860f0f0b8, isTopLevel=1 '\001') at
vacuum.c:320
#9 0x000055a85f2fd92a in autovacuum_do_vac_analyze (tab=0x55a860f0f020,
bstrategy=0x55a860f0f0b8)
at autovacuum.c:2874
#10 0x000055a85f2fcccb in do_autovacuum () at autovacuum.c:2374
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Koda!The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter. It acquires Access Exclusive lock on the relation. Without the strong Access Exclusive lock, just running DML statements use the fast path locking, which doesn't acquire the lock manager partition lock.
Looks like it is not true (at lest for PG9.6):
#0 0x00007fa6d30da087 in semop () from /lib64/libc.so.6
#1 0x0000000000682241 in PGSemaphoreLock
(sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387
#2 0x00000000006ec6eb in LWLockAcquire (lock=lock@entry=0x7f23b544f800,
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338
#3 0x00000000006e5560 in LockAcquireExtended
(locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1,
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0
'\000', reportMemoryError=reportMemoryError@entry=1 '\001',
locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962
#4 0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1)
at lmgr.c:113
#5 0x00000000004a9f55 in relation_open (relationId=87103837,
lockmode=lockmode@entry=1) at heapam.c:1131
#6 0x00000000004bdc66 in index_open (relationId=<optimized out>,
lockmode=lockmode@entry=1) at indexam.c:151
#7 0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758,
relationObjectId=72079078, inhparent=<optimized out>,
rel=rel@entry=0x3a2d460) at plancat.c:183
#8 0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758,
relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148
Please notice lockmode=1 (AccessShareLock)
The long lwlock wait is a sad story. The victim is probably exclusive lockers. When someone holds a shared lock on a lwlock, the exclusive locker has to wait. That's OK. However, if another share locker comes later, it acquires the lwlock even though there're waiting exclusive lockers. That's unfair, but this is the community decision.
Yes, I also think that it is the reason of the problem.
Alexandr Korotokov has implemented fair LW-Locks which eliminate such
kind of problems in some scenarios.
May it also can help here.
On 09.07.2020 00:35, Tom Lane wrote:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
There are several thousand clients, most of which are executing complex
queries.So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
It is not my problem - it is customer's problem.
Certainly the advice to use connection pooler is the first thing we have
proposed to the customer when we see such larger number of active backends.
Unfortunately it is not always possible (connection pooler is not
preseving session semantic).
This is why I have proposed builtin connection pooler for Postgres.
But it is different story.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.No, this isn't a problem. Or at least you haven't shown a reason to
think it is. Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).
Certainly cache overrun is not fatal.
But if most of backends are blocked in heap_open pf pg_attribute
relation then something is not ok with Postgres, isn't it?
It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!
Really?
Planning time: 75698.602 ms
Execution time: 0.861 ms
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.Seems like self-inflicted damage. I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.
Looks like you try to convince me that such practice of hardcoding
constants in code and
not taken in account relation between them is good design pattern?
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.I seriously doubt we'd do that.
It's a pity, because such attention is one of the reasons why Postgres
is pgbench-oriented database showing good results at notebooks
but not at real systems running at power servers (NUMA, SSD, huge amount
of memory, large number of cores,...).
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Looks like it is not true (at lest for PG9.6):
#0 0x00007fa6d30da087 in semop () from /lib64/libc.so.6
#1 0x0000000000682241 in PGSemaphoreLock
(sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387
#2 0x00000000006ec6eb in LWLockAcquire
(lock=lock@entry=0x7f23b544f800,
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338
#3 0x00000000006e5560 in LockAcquireExtended
(locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1,
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0
'\000', reportMemoryError=reportMemoryError@entry=1 '\001',
locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962
#4 0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1)
at lmgr.c:113
#5 0x00000000004a9f55 in relation_open (relationId=87103837,
lockmode=lockmode@entry=1) at heapam.c:1131
#6 0x00000000004bdc66 in index_open (relationId=<optimized out>,
lockmode=lockmode@entry=1) at indexam.c:151
#7 0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758,
relationObjectId=72079078, inhparent=<optimized out>,
rel=rel@entry=0x3a2d460) at plancat.c:183
#8 0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758,
relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148Please notice lockmode=1 (AccessShareLock)
Ouch, there exists another sad hardcoded value: the number of maximum locks that can be acquired by the fast-path mechanism.
[LockAcquireExtended]
/*
* Attempt to take lock via fast path, if eligible. But if we remember
* having filled up the fast path array, we don't attempt to make any
* further use of it until we release some locks. It's possible that some
* other backend has transferred some of those locks to the shared hash
* table, leaving space free, but it's not worth acquiring the LWLock just
* to check. It's also possible that we're acquiring a second or third
* lock type on a relation we have already locked using the fast-path, but
* for now we don't worry about that case either.
*/
if (EligibleForRelationFastPath(locktag, lockmode) &&
FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND)
{
/*
* We allow a small number of "weak" relation locks (AccessShareLock,
* RowShareLock, RowExclusiveLock) to be recorded in the PGPROC structure
* rather than the main lock table. This eases contention on the lock
* manager LWLocks. See storage/lmgr/README for additional details.
*/
#define FP_LOCK_SLOTS_PER_BACKEND 16
16 looks easily exceeded even in a not-long OLTP transaction... especially the table is partitioned. I wonder if we're caught in the hell of lock manager partition lock contention without knowing it. I'm afraid other pitfalls are lurking when there are many relations.
Regards
Takayuki Tsunakawa
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
There are several thousand clients, most of which are executing complex
queries.So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.
So them are not able to process all this invalidation messages and their
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
changed without recompilation of Postgres.
This is problem N1.No, this isn't a problem. Or at least you haven't shown a reason to
think it is. Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).
Testing that it doesn't outright break and having it be decently
performant are two rather different things. I think we're talking more
about performance and not so much about if the system is outright broken
in this case.
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)Having just 16 LW-Locks greatly increase conflict probability (taken in
account that there are 5k tables and totally about 25k relations).It cause huge lw-lock acquisition time for heap_open and planning stage
of some queries is increased from milliseconds to several minutes!Really?
Apparently, given the response down-thread.
This is problem number 2. But there is one more flaw we have faced with.
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere.Seems like self-inflicted damage. I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.
Perhaps not in the docs, but would be good to make note of it somewhere,
as I don't think it's really appropriate to assume these constants won't
ever change and whomever contemplates changing them would appreciate
knowing about other related values..
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to
be replaced with GUCs.I seriously doubt we'd do that.
Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems.
As I understood the report, it was not "things completely fall over",
it was "performance gets bad". But let's get real. Unless the OP
has a machine with thousands of CPUs, trying to run this way is
counterproductive.
Perhaps in a decade or two such machines will be common enough that
it'll make sense to try to tune Postgres to run well on them. Right
now I feel no hesitation about saying "if it hurts, don't do that".
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems.As I understood the report, it was not "things completely fall over",
it was "performance gets bad". But let's get real. Unless the OP
has a machine with thousands of CPUs, trying to run this way is
counterproductive.
Right, the issue is that performance gets bad (or, really, more like
terrible...), and regardless of if it's ideal or not, lots of folks
actually do run PG with thousands of connections, and we know that at
start-up time because they've set max_connections to a sufficiently high
value to support doing exactly that.
Perhaps in a decade or two such machines will be common enough that
it'll make sense to try to tune Postgres to run well on them. Right
now I feel no hesitation about saying "if it hurts, don't do that".
I disagree that we should completely ignore these use-cases.
Thanks,
Stephen
Hi Stephen,
Thank you for supporting an opinion that it is the problems not only of
client system design (I agree it is not so good
idea to have thousands tables and thousands active backends) but also of
Postgres.
We have made more investigation and found out one more problem in
Postgres causing "invalidation storm".
There are some log living transactions which prevent autovacuum to do it
work and� remove dead tuples.
So autovacuum is started once and once again and each time did no
progress but updated statistics and so sent invalidation messages.
autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum
proceed huge number of tables and initiated large number of invalidation
messages which quite soon cause overflow of validation message buffers
for backends performing long OLAP queries.
It makes me think about two possible optimizations:
1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages
and reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then
most likely this fields are also changed. So we have to remove this relation
from relation cache in any case.
2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.
Thoughts?
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.
I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular
case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application
has to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At
least it should first terminate current connection using
pg_terminate_backend. I just want to notice that large number of
backends was not the core of the problem.
Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?
I think that optimal value of number of lock partitions should depend
not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of� MAX_SIMUL_LWLOCKS.
There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too
expensive and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS
falls down to 0).
Proposed patch for this problem was also rejected (once again - problem
can be reproduced only of powerful server with large number of cores).
On 09.07.2020 18:14, Tom Lane wrote:
As I understood the report, it was not "things completely fall over",
it was "performance gets bad". But let's get real. Unless the OP
has a machine with thousands of CPUs, trying to run this way is
counterproductive.
Sorry, that I was not clear. It is actually case when "things completely
fall over".
If query planning time takes several minutes and so user response time
is increased from seconds to hours,
then system becomes unusable, doesn't it?
Perhaps in a decade or two such machines will be common enough that
it'll make sense to try to tune Postgres to run well on them. Right
now I feel no hesitation about saying "if it hurts, don't do that".
Unfortunately we have not to wait for decade or two.
Postgres is faced with multiple problems at existed multiprocessor
systems (64, 96,.. cores).
And it is not even necessary to initiate thousands of connections: just
enough to load all this cores and let them compete for some
resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks
with zipfian distribution may illustrate this problems.
There were many proposed patches which help to improve this situation.
But as far as this patches increase performance only at huge servers
with large number of cores and show almost no
improvement (or even some degradation) at standard 4-cores desktops,
almost none of them were committed.
Consequently our customers have a lot of troubles trying to replace
Oracle with Postgres and provide the same performance at same
(quite good and expensive) hardware.
Hi Konstantin, a silly question: do you consider the workload you have as
well-optimized? Can it be optimized further? Reading this thread I have a
strong feeling that a very basic set of regular optimization actions is
missing here (or not explained): query analysis and optimization based on
pg_stat_statements (and, maybe pg_stat_kcache), some method to analyze the
state of the server in general, resource consumption, etc.
Do you have some monitoring that covers pg_stat_statements?
Before looking under the hood, I would use multiple pg_stat_statements
snapshots (can be analyzed using, say, postgres-checkup or pgCenter) to
understand the workload and identify the heaviest queries -- first of all,
in terms of total_time, calls, shared buffers reads/hits, temporary files
generation. Which query groups are Top-N in each category, have you looked
at it?
You mentioned some crazy numbers for the planning time, but why not to
analyze the picture holistically and see the overall numbers? Those queries
that have increased planning time, what their part of total_time, on the
overall picture, in %? (Unfortunately, we cannot see Top-N by planning time
in pg_stat_statements till PG13, but it doesn't mean that we cannot have
some good understanding of overall picture today, it just requires more
work).
If workload analysis & optimization was done holistically already, or not
possible due to some reason — pardon me. But if not and if your primary
goal is to improve this particular setup ASAP, then the topic could be
started in the -performance mailing list first, discussing the workload and
its aspects, and only after it's done, raised in -hackers. No?
On Thu, Jul 9, 2020 at 8:57 AM Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:
Show quoted text
Hi Stephen,
Thank you for supporting an opinion that it is the problems not only of
client system design (I agree it is not so good
idea to have thousands tables and thousands active backends) but also of
Postgres.We have made more investigation and found out one more problem in
Postgres causing "invalidation storm".
There are some log living transactions which prevent autovacuum to do it
work and remove dead tuples.
So autovacuum is started once and once again and each time did no
progress but updated statistics and so sent invalidation messages.
autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum
proceed huge number of tables and initiated large number of invalidation
messages which quite soon cause overflow of validation message buffers
for backends performing long OLAP queries.It makes me think about two possible optimizations:
1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages
and reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then
most likely this fields are also changed. So we have to remove this
relation
from relation cache in any case.2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.Thoughts?
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular
case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application
has to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At
least it should first terminate current connection using
pg_terminate_backend. I just want to notice that large number of
backends was not the core of the problem.Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?I think that optimal value of number of lock partitions should depend
not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of MAX_SIMUL_LWLOCKS.There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too
expensive and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS
falls down to 0).
Proposed patch for this problem was also rejected (once again - problem
can be reproduced only of powerful server with large number of cores).
On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.
This option looks good for me independently of the use case under
consideration. Long-running transactions are an old and well-known
problem. If we can skip some useless work here, let's do this.
------
Regards,
Alexander Korotkov
Greetings,
* Konstantin Knizhnik (k.knizhnik@postgrespro.ru) wrote:
It makes me think about two possible optimizations:
1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages and
reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then most
likely this fields are also changed. So we have to remove this relation
from relation cache in any case.
I realize this is likely to go over like a lead balloon, but the churn
in pg_class from updating reltuples/relpages has never seemed all that
great to me when just about everything else is so rarely changed, and
only through some user DDL action- and I agree that it seems like those
particular columns are more 'statistics' type of info and less info
about the definition of the relation. Other columns that do get changed
regularly are relfrozenxid and relminmxid. I wonder if it's possible to
move all of those elsewhere- perhaps some to the statistics tables as
you seem to be alluding to, and the others to $somewhereelse that is
dedicated to tracking that information which VACUUM is primarily
concerned with.
2. Remember in relation info XID of oldest active transaction at the moment
of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID with
current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.Thoughts?
That sounds like an interesting optimization and I agree it'd be nice to
avoid the re-run of autovacuum when we can tell that there's not going
to be anything more we can do. As noted above, for my part, I think
it'd be nice to move that kind of ongoing maintenance/updates out of
pg_class, but just in general I agree with the idea to store that info
somewhere and wait until there's actually been progress in the global
xmin before re-running a vacuum on a table. If we can do that somewhere
outside of pg_class, I think that'd be better, but if no one is up for
that kind of a shake-up, then maybe we just put it in pg_class and deal
with the churn there.
So, that's really the core of your problem. We don't promise that
you can run several thousand backends at once. Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems. I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular case
shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application has
to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At least
it should first terminate current connection using pg_terminate_backend. I
just want to notice that large number of backends was not the core of the
problem.
Yeah, this is all getting back to the fact that we don't have an
acceptance criteria or anything like that, where we'd actually hold off
on new connections/queries being allowed in while other things are
happening. Of course, a connection pooler would address this (and you
could use one and have it still look exactly like PG, if you use, say,
pgbouncer in session-pooling mode, but then you need to have the
application drop/reconnect and not do its own connection pooling..), but
it'd be nice to have something in core for this.
Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?I think that optimal value of number of lock partitions should depend not on
number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
A GUC for 'number of CPUs' doesn't seem like a bad option to have. How
to make that work well may be challenging though.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of MAX_SIMUL_LWLOCKS.
Sure.
There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too expensive
and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS falls
down to 0).
Proposed patch for this problem was also rejected (once again - problem can
be reproduced only of powerful server with large number of cores).
That does sound like something that would be good to improve on, though
I haven't looked at the proposed patch or read the associated thread, so
I'm not sure I can really comment on its rejection specifically.
Thanks,
Stephen
On 09.07.2020 19:19, Nikolay Samokhvalov wrote:
Hi Konstantin, a silly question: do you consider the workload you have
as well-optimized? Can it be optimized further? Reading this thread I
have a strong feeling that a very basic set of regular optimization
actions is missing here (or not explained): query analysis and
optimization based on pg_stat_statements (and, maybe pg_stat_kcache),
some method to analyze the state of the server in general, resource
consumption, etc.Do you have some monitoring that covers pg_stat_statements?
Before looking under the hood, I would use multiple pg_stat_statements
snapshots (can be analyzed using, say, postgres-checkup or pgCenter)
to understand the workload and identify the heaviest queries -- first
of all, in terms of total_time, calls, shared buffers reads/hits,
temporary files generation. Which query groups are Top-N in each
category, have you looked at it?You mentioned some crazy numbers for the planning time, but why not to
analyze the picture holistically and see the overall numbers? Those
queries that have increased planning time, what their part of
total_time, on the overall picture, in %? (Unfortunately, we cannot
see Top-N by planning time in pg_stat_statements till PG13, but it
doesn't mean that we cannot have some good understanding of overall
picture today, it just requires more work).If workload analysis & optimization was done holistically already, or
not possible due to some reason — pardon me. But if not and if your
primary goal is to improve this particular setup ASAP, then the topic
could be started in the -performance mailing list first, discussing
the workload and its aspects, and only after it's done, raised in
-hackers. No?
Certainly, both we and customer has made workload analysis & optimization.
It is not a problem of particular queries, bad plans, resource
exhaustion,...
Unfortunately there many scenarios when Postgres demonstrates not
gradual degrade of performance with increasing workload,
but "snow avalanche" whennegative feedback cause very fastparalysis of
the system.
This case is just one if this scenarios. It is hard to say for sure what
triggers the avalanche... Long living transaction, huge number of tables,
aggressive autovacuum settings... But there is cascade of negative
events which cause system which normally function for months to stop
working at all.
In this particular case we have the following chain:
- long living transaction cause autovacuum to send a lot of invalidation
message
- this messages cause overflow of invalidation message queues, forcing
backens to invalidate their caches and reload from catalog.
- too small value of fastpath lock cache cause many concurrent accesses
to shared lock hash
- contention for LW-lock caused by small number of lock partition cause
starvation
Great idea.
In addition to this, it would be good to consider another optimization for
the default transaction isolation level: making autovacuum to clean dead
tuples in relations that are not currently used in any transaction and when
there are no IN_PROGRESS transactions running at RR or S level (which is a
very common case because RC is the default level and this is what is
actively used in heavily loaded OLTP systems which often suffer from
long-running transactions). I don't know the details of how easy it would
be to implement, but it always wondered that autovacuum has the global XID
"horizon".
With such an optimization, the "hot_standby_feedback=on" mode could be
implemented also more gracefully, reporting "min(xid)" for ongoing
transactions on standbys separately for RC and RR/S levels.
Without this, we cannot have good performance for HTAP use cases for
Postgres – the presence of just a small number of long-running
transactions, indeed, is known to kill the performance of OLTP workloads
quickly. And leading to much faster bloat growth than it could be.
However, maybe I'm wrong in these considerations, or it's impossible / too
difficult to implement.
On Thu, Jul 9, 2020 at 9:38 AM Alexander Korotkov <aekorotkov@gmail.com>
wrote:
Show quoted text
On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.This option looks good for me independently of the use case under
consideration. Long-running transactions are an old and well-known
problem. If we can skip some useless work here, let's do this.------
Regards,
Alexander Korotkov
Greetings,
We generally prefer that you don't top-post on these lists.
* Nikolay Samokhvalov (samokhvalov@gmail.com) wrote:
In addition to this, it would be good to consider another optimization for
the default transaction isolation level: making autovacuum to clean dead
tuples in relations that are not currently used in any transaction and when
there are no IN_PROGRESS transactions running at RR or S level (which is a
very common case because RC is the default level and this is what is
actively used in heavily loaded OLTP systems which often suffer from
long-running transactions). I don't know the details of how easy it would
be to implement, but it always wondered that autovacuum has the global XID
"horizon".
Yeah, I've had thoughts along the same lines, though I had some ideas
that we could actually manage to support it even with RR (at least...
not sure about serializable) by considering what tuples the transactions
in the system could actually see (eg: even with RR, a tuple created
after that transaction started and was then deleted wouldn't ever be
able to be seen and therefore could be cleaned up..). A further thought
on that was to only spend that kind of effort once a tuple had aged a
certain amount, though it depends a great deal on exactly what would
need to be done for this.
Unfortunately, anything in this area is likely to carry a good bit of
risk associated with it as VACUUM doing the wrong thing would be quite
bad.
Thanks,
Stephen
On 7/8/20 11:41 PM, Konstantin Knizhnik wrote:
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have
to be replaced with GUCs.
To avoid division, we can specify log2 of this values, so shift can be
used instead.
And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS +
NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.
Because I was involved in this particular case and don`t want it to
became a habit, I`m volunteering to test whatever patch this discussion
will produce.
--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Jul 9, 2020 at 10:00 PM Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
In addition to this, it would be good to consider another optimization for the default transaction isolation level: making autovacuum to clean dead tuples in relations that are not currently used in any transaction and when there are no IN_PROGRESS transactions running at RR or S level (which is a very common case because RC is the default level and this is what is actively used in heavily loaded OLTP systems which often suffer from long-running transactions). I don't know the details of how easy it would be to implement, but it always wondered that autovacuum has the global XID "horizon".
With such an optimization, the "hot_standby_feedback=on" mode could be implemented also more gracefully, reporting "min(xid)" for ongoing transactions on standbys separately for RC and RR/S levels.
Yes, the current way of calculation of dead tuples is lossy, because
we only rely on the oldest xid. However, if we would keep the oldest
snapshot instead of oldest xmin, long-running transactions wouldn't be
such a disaster. I don't think this is feasible with the current
snapshot model, because keeping full snapshots instead of just xmins
would bloat shared-memory structs and complicate computations. But
CSN can certainly support this optimization.
------
Regards,
Alexander Korotkov
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Unfortunately we have not to wait for decade or two.
Postgres is faced with multiple problems at existed multiprocessor
systems (64, 96,.. cores).
And it is not even necessary to initiate thousands of connections: just
enough to load all this cores and let them compete for some
resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks
with zipfian distribution may illustrate this problems.
I concur with you. VMs and bare metal machines with 100~200 CPU cores and TBs of RAM are already available even on public clouds. The users easily set max_connections to a high value like 10,000, create thousands or tens of thousands of relations, and expect it to go smoothly. Although it may be a horror for PG developers who know the internals well, Postgres has grown a great database to be relied upon.
Besides, I don't want people to think like "Postgres cannot scale up on one machine, so we need scale-out." I understand some form of scale-out is necessary for large-scale analytics and web-scale multitenant OLTP, but it would be desirable to be able to cover the OLTP workloads for one organization/region with the advances in hardware and Postgres leveraging those advances, without something like Oracle RAC.
There were many proposed patches which help to improve this situation.
But as far as this patches increase performance only at huge servers
with large number of cores and show almost no
improvement (or even some degradation) at standard 4-cores desktops,
almost none of them were committed.
Consequently our customers have a lot of troubles trying to replace
Oracle with Postgres and provide the same performance at same
(quite good and expensive) hardware.
Yeah, it's a pity that the shiny-looking patches from Postgres Pro (mostly from Konstantin san?) -- autoprepare, built-in connection pooling, fair lwlock, and revolutionary multi-threaded backend -- haven't gained hot atension.
Regards
Takayuki Tsunakawa
On 09.07.2020 22:16, Grigory Smolkin wrote:
On 7/8/20 11:41 PM, Konstantin Knizhnik wrote:
So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have
to be replaced with GUCs.
To avoid division, we can specify log2 of this values, so shift can
be used instead.
And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS +
NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.Because I was involved in this particular case and don`t want it to
became a habit, I`m volunteering to test whatever patch this
discussion will produce.
You are welcome:)
Attachments:
skip_autovacuum.patchtext/x-patch; charset=UTF-8; name=skip_autovacuum.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 68effcaed6..27f6328dca 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -607,7 +607,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ OldestXmin);
pgstat_progress_end_command();
/* and log the action if appropriate */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60..0eed10fd65 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -92,6 +92,7 @@
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
#include "storage/proc.h"
+#include "storage/procarray.h"
#include "storage/procsignal.h"
#include "storage/sinvaladt.h"
#include "storage/smgr.h"
@@ -2992,6 +2993,9 @@ relation_needs_vacanalyze(Oid relid,
TransactionId xidForceLimit;
MultiXactId multiForceLimit;
+ TransactionId oldestXmin;
+ Relation rel;
+
AssertArg(classForm != NULL);
AssertArg(OidIsValid(relid));
@@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
instuples = tabentry->inserts_since_vacuum;
anltuples = tabentry->changes_since_analyze;
+ rel = RelationIdGetRelation(relid);
+ oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+ RelationClose(rel);
+
vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;
vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples;
anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
@@ -3095,9 +3103,20 @@ relation_needs_vacanalyze(Oid relid,
vactuples, vacthresh, anltuples, anlthresh);
/* Determine if this table needs vacuum or analyze. */
- *dovacuum = force_vacuum || (vactuples > vacthresh) ||
- (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
- *doanalyze = (anltuples > anlthresh);
+ if (tabentry->autovac_oldest_xmin != oldestXmin)
+ {
+ *dovacuum = force_vacuum || (vactuples > vacthresh) ||
+ (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
+ *doanalyze = (anltuples > anlthresh);
+ elog(DEBUG1, "Consider relation %d tabentry=%p, tabentry->autovac_oldest_xmin=%d, oldestXmin=%d, dovacuum=%d, doanalyze=%d",
+ relid, tabentry, tabentry->autovac_oldest_xmin, oldestXmin, *dovacuum, *doanalyze);
+ }
+ else
+ {
+ elog(DEBUG1, "Skip autovacuum of relation %d", relid);
+ *dovacuum = force_vacuum;
+ *doanalyze = false;
+ }
}
else
{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..99db7884de 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1461,7 +1461,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ TransactionId oldestXmin)
{
PgStat_MsgVacuum msg;
@@ -1475,6 +1476,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_oldest_xmin = oldestXmin;
pgstat_send(&msg, sizeof(msg));
}
@@ -4838,6 +4840,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->analyze_count = 0;
result->autovac_analyze_timestamp = 0;
result->autovac_analyze_count = 0;
+ result->autovac_oldest_xmin = InvalidTransactionId;
}
return result;
@@ -6007,6 +6010,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->analyze_count = 0;
tabentry->autovac_analyze_timestamp = 0;
tabentry->autovac_analyze_count = 0;
+ tabentry->autovac_oldest_xmin = InvalidTransactionId;
}
else
{
@@ -6288,6 +6292,8 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->autovac_oldest_xmin = msg->m_oldest_xmin;
+
/*
* It is quite possible that a non-aggressive VACUUM ended up skipping
* various pages, however, we'll zero the insert counter here regardless.
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..ef92c5632f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -383,6 +383,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ TransactionId m_oldest_xmin;
} PgStat_MsgVacuum;
@@ -692,6 +693,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter analyze_count;
TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
PgStat_Counter autovac_analyze_count;
+
+ TransactionId autovac_oldest_xmin;
} PgStat_StatTabEntry;
@@ -1301,7 +1304,8 @@ extern void pgstat_reset_slru_counter(const char *);
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ TransactionId oldestXmin);
extern void pgstat_report_analyze(Relation rel,
PgStat_Counter livetuples, PgStat_Counter deadtuples,
bool resetcounter);
On Thu, 2020-07-09 at 12:47 -0400, Stephen Frost wrote:
I realize this is likely to go over like a lead balloon, but the churn
in pg_class from updating reltuples/relpages has never seemed all that
great to me when just about everything else is so rarely changed, and
only through some user DDL action- and I agree that it seems like those
particular columns are more 'statistics' type of info and less info
about the definition of the relation. Other columns that do get changed
regularly are relfrozenxid and relminmxid. I wonder if it's possible to
move all of those elsewhere- perhaps some to the statistics tables as
you seem to be alluding to, and the others to $somewhereelse that is
dedicated to tracking that information which VACUUM is primarily
concerned with.
Perhaps we could create pg_class with a fillfactor less than 100
so we het HOT updates there.
That would be less invasive.
Yours,
Laurenz Albe
On Fri, Jul 10, 2020 at 02:10:20AM +0000, tsunakawa.takay@fujitsu.com
wrote:
There were many proposed patches which help to improve this
situation. But as far as this patches increase performance only
at huge servers with large number of cores and show almost no
improvement (or even some degradation) at standard 4-cores desktops,
almost none of them were committed. Consequently our customers have
a lot of troubles trying to replace Oracle with Postgres and provide
the same performance at same (quite good and expensive) hardware.Yeah, it's a pity that the shiny-looking patches from Postgres Pro
(mostly from Konstantin san?) -- autoprepare, built-in connection
pooling, fair lwlock, and revolutionary multi-threaded backend --
haven't gained hot atension.
Yeah, it is probably time for us to get access to a current large-scale
machine again and really find the bottlenecks. We seem to next this
every few years.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
On 2020-Jul-10, Konstantin Knizhnik wrote:
@@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
instuples = tabentry->inserts_since_vacuum;
anltuples = tabentry->changes_since_analyze;+ rel = RelationIdGetRelation(relid); + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); + RelationClose(rel);
*cough*
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 15.07.2020 02:17, Alvaro Herrera wrote:
On 2020-Jul-10, Konstantin Knizhnik wrote:
@@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
instuples = tabentry->inserts_since_vacuum;
anltuples = tabentry->changes_since_analyze;+ rel = RelationIdGetRelation(relid); + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); + RelationClose(rel);*cough*
Sorry, Alvaro.
Can you explain this *cough*
You didn't like that relation is opened just to call GetOldestXmin?
But this functions requires Relation. Do you suggest to rewrite it so
that it is possible to pass just Oid of relation?
Or you do you think that such calculation of oldestSmin is obscure and
at lest requires some comment?
Actually, I have copied it from vacuum.c and there is a large comment
explaining why it is calculated in this way.
May be it is enough to add reference to vacuum.c?
Or may be create some special function for it?
I just need to oldestXmin in calculated in the same way in vacuum.c and
autovacuum.c
On 2020-Jul-15, Konstantin Knizhnik wrote:
On 15.07.2020 02:17, Alvaro Herrera wrote:
On 2020-Jul-10, Konstantin Knizhnik wrote:
@@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid, instuples = tabentry->inserts_since_vacuum; anltuples = tabentry->changes_since_analyze; + rel = RelationIdGetRelation(relid); + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); + RelationClose(rel);*cough*
Sorry, Alvaro.
Can you explain this *cough*
You didn't like that relation is opened� just to call GetOldestXmin?
But this functions requires Relation. Do you suggest to rewrite it so that
it is possible to pass just Oid of relation?
At that point of autovacuum, you don't have a lock on the relation; the
only thing you have is a pg_class tuple (and we do it that way on
purpose as I recall). I think asking relcache for it is dangerous, and
moreover requesting relcache for it directly goes counter our normal
coding pattern. At the very least you should have a comment explaining
why you do it and why it's okay to do it, and also handle the case when
RelationIdGetRelation returns null.
However, looking at the bigger picture I wonder if it would be better to
test the getoldestxmin much later in the process to avoid this whole
issue. Just carry forward the relation until the point where vacuum is
called ... that may be cleaner? And the extra cost is not that much.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 15.07.2020 18:03, Alvaro Herrera wrote:
On 2020-Jul-15, Konstantin Knizhnik wrote:
On 15.07.2020 02:17, Alvaro Herrera wrote:
On 2020-Jul-10, Konstantin Knizhnik wrote:
@@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid, instuples = tabentry->inserts_since_vacuum; anltuples = tabentry->changes_since_analyze; + rel = RelationIdGetRelation(relid); + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); + RelationClose(rel);*cough*
Sorry, Alvaro.
Can you explain this *cough*
You didn't like that relation is opened just to call GetOldestXmin?
But this functions requires Relation. Do you suggest to rewrite it so that
it is possible to pass just Oid of relation?At that point of autovacuum, you don't have a lock on the relation; the
only thing you have is a pg_class tuple (and we do it that way on
purpose as I recall). I think asking relcache for it is dangerous, and
moreover requesting relcache for it directly goes counter our normal
coding pattern. At the very least you should have a comment explaining
why you do it and why it's okay to do it, and also handle the case when
RelationIdGetRelation returns null.However, looking at the bigger picture I wonder if it would be better to
test the getoldestxmin much later in the process to avoid this whole
issue. Just carry forward the relation until the point where vacuum is
called ... that may be cleaner? And the extra cost is not that much.
Thank you for explanation.
I have prepared new version of the patch which opens relation with care.
Concerning your suggestion to perform this check later (in vacuum_rel()
I guess?)
I tried to implement it but faced with another problem.
Right now information about autovacuum_oldest_xmin for relationis stored
in statistic (PgStat_StatTabEntry)
together with other atovacuum related fields like
autovac_vacuum_timestamp, autovac_analyze_timestamp,...)
I am not sure that it is right place for storing autovacuum_oldest_xmin
but I didn't want to organize in shared memory yet another hash table
just for keeping this field. May be you can suggest something better...
But right now it is stored here when vacuum is completed.
PgStat_StatTabEntry is obtained by get_pgstat_tabentry_relid() which
also needs pgstat_fetch_stat_dbentry(MyDatabaseId) and
pgstat_fetch_stat_dbentry(InvalidOid). I do not want to copy all this
stuff to vacuum.c.
It seems to me to be easier to open relation in
relation_needs_vacanalyze(), isn;t it?
Attachments:
skip_autovacuum-2.patchtext/x-patch; charset=UTF-8; name=skip_autovacuum-2.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f7..f7e17a9bbf 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -607,7 +607,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ OldestXmin);
pgstat_progress_end_command();
/* and log the action if appropriate */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60..00f7cb4ca1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -92,6 +92,7 @@
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
#include "storage/proc.h"
+#include "storage/procarray.h"
#include "storage/procsignal.h"
#include "storage/sinvaladt.h"
#include "storage/smgr.h"
@@ -2992,6 +2993,9 @@ relation_needs_vacanalyze(Oid relid,
TransactionId xidForceLimit;
MultiXactId multiForceLimit;
+ TransactionId oldestXmin = InvalidTransactionId;
+ Relation rel;
+
AssertArg(classForm != NULL);
AssertArg(OidIsValid(relid));
@@ -3076,6 +3080,20 @@ relation_needs_vacanalyze(Oid relid,
instuples = tabentry->inserts_since_vacuum;
anltuples = tabentry->changes_since_analyze;
+ /*
+ * We need to calculate oldestXmin for relation in the same way as it is done in vacuum.c
+ * (see comment in vacuum_set_xid_limits).
+ * This is why we have to open relation.
+ */
+ if (ConditionalLockRelationOid(relid, AccessShareLock))
+ {
+ rel = try_relation_open(relid, NoLock);
+ if (rel)
+ {
+ oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+ relation_close(rel, AccessShareLock);
+ }
+ }
vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;
vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples;
anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
@@ -3094,10 +3112,24 @@ relation_needs_vacanalyze(Oid relid,
NameStr(classForm->relname),
vactuples, vacthresh, anltuples, anlthresh);
- /* Determine if this table needs vacuum or analyze. */
- *dovacuum = force_vacuum || (vactuples > vacthresh) ||
- (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
- *doanalyze = (anltuples > anlthresh);
+ /*
+ * Determine if this table needs vacuum or analyze.
+ * Do not perform autovacuum if we have already did it with the same oldestXmin.
+ */
+ if (!TransactionIdIsValid(oldestXmin) || tabentry->autovac_oldest_xmin != oldestXmin)
+ {
+ *dovacuum = force_vacuum || (vactuples > vacthresh) ||
+ (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
+ *doanalyze = (anltuples > anlthresh);
+ elog(DEBUG1, "Consider relation %d tabentry=%p, tabentry->autovac_oldest_xmin=%d, oldestXmin=%d, dovacuum=%d, doanalyze=%d",
+ relid, tabentry, tabentry->autovac_oldest_xmin, oldestXmin, *dovacuum, *doanalyze);
+ }
+ else
+ {
+ elog(DEBUG1, "Skip autovacuum of relation %d", relid);
+ *dovacuum = force_vacuum;
+ *doanalyze = false;
+ }
}
else
{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2da2..67504d1d99 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1461,7 +1461,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ TransactionId oldestXmin)
{
PgStat_MsgVacuum msg;
@@ -1475,6 +1476,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_oldest_xmin = oldestXmin;
pgstat_send(&msg, sizeof(msg));
}
@@ -4838,6 +4840,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->analyze_count = 0;
result->autovac_analyze_timestamp = 0;
result->autovac_analyze_count = 0;
+ result->autovac_oldest_xmin = InvalidTransactionId;
}
return result;
@@ -6007,6 +6010,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
tabentry->analyze_count = 0;
tabentry->autovac_analyze_timestamp = 0;
tabentry->autovac_analyze_count = 0;
+ tabentry->autovac_oldest_xmin = InvalidTransactionId;
}
else
{
@@ -6288,6 +6292,8 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->autovac_oldest_xmin = msg->m_oldest_xmin;
+
/*
* It is quite possible that a non-aggressive VACUUM ended up skipping
* various pages, however, we'll zero the insert counter here regardless.
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..ef92c5632f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -383,6 +383,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ TransactionId m_oldest_xmin;
} PgStat_MsgVacuum;
@@ -692,6 +693,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter analyze_count;
TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
PgStat_Counter autovac_analyze_count;
+
+ TransactionId autovac_oldest_xmin;
} PgStat_StatTabEntry;
@@ -1301,7 +1304,8 @@ extern void pgstat_reset_slru_counter(const char *);
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ TransactionId oldestXmin);
extern void pgstat_report_analyze(Relation rel,
PgStat_Counter livetuples, PgStat_Counter deadtuples,
bool resetcounter);