Protect syscache from bloating with negative cache entries
Hello, recently one of my customer stumbled over an immoderate
catcache bloat.
This is a known issue living on the Todo page in the PostgreSQL
wiki.
https://wiki.postgresql.org/wiki/Todo#Cache_Usage
Fix memory leak caused by negative catcache entries
/messages/by-id/51C0A1FF.2050404@vmware.com
This patch addresses the two cases of syscache bloat by using
invalidation callback mechanism.
Overview of the patch
The bloat is caused by negative cache entries in catcaches. They
are crucial for performance but it is a problem that there's no
way to remove them. They last for the backends' lifetime.
The first patch provides a means to flush catcache negative
entries, then defines a relcache invalidation callback to flush
negative entries in syscaches for pg_statistic(STATRELATTINH) and
pg_attributes(ATTNAME, ATTNUM). The second patch implements a
syscache invalidation callback so that deletion of a schema
causes a flush for pg_class (RELNAMENSP).
Both of the aboves are not hard-coded and defined in cacheinfo
using additional four members.
Remaining problems
Still, catcache can bloat by repeatedly accessing non-existent
table with unique names in a permanently-living schema but it
seems a bit too artificial (or malicious). Since such negative
entries don't have a trigger to remove, caps are needed to
prevent them from bloating syscaches, but the limits are hardly
seem reasonably determinable.
Defects or disadvantages
This patch scans over whole the target catcache to find negative
entries to remove and it might take a (comparably) long time on a
catcache with so many entries. By the second patch, unrelated
negative caches may be involved in flushing since they are keyd
by hashvalue, not by the exact key values.
The attached files are the following.
1. 0001-Cleanup-negative-cache-of-pg_statistic-when-dropping.patch
Negative entry flushing by relcache invalidation using
relcache invalidation callback.
2. 0002-Cleanup-negative-cache-of-pg_class-when-dropping-a-s.patch
Negative entry flushing by catcache invalidation using
catcache invalidation callback.
3. gen.pl
a test script for STATRELATTINH bloating.
4. gen2.pl
a test script for RELNAMENSP bloating.
3 and 4 are used as the following,
./gen.pl | psql postgres > /dev/null
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Mon, Dec 19, 2016 at 6:15 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello, recently one of my customer stumbled over an immoderate
catcache bloat.
This isn't only an issue for negative catcache entries. A long time
ago, there was a limit on the size of the relcache, which was removed
because if you have a workload where the working set of relations is
just larger than the limit, performance is terrible. But the problem
now is that backend memory usage can grow without bound, and that's
also bad, especially on systems with hundreds of long-lived backends.
In connection-pooling environments, the problem is worse, because
every connection in the pool eventually caches references to
everything of interest to any client.
Your patches seem to me to have some merit, but I wonder if we should
also consider having a time-based threshold of some kind. If, say, a
backend hasn't accessed a catcache or relcache entry for many minutes,
it becomes eligible to be flushed. We could implement this by having
some process, like the background writer,
SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
every 10 minutes or so. When a process receives this signal, it sets
a flag that is checked before going idle. When it sees the flag set,
it makes a pass over every catcache and relcache entry. All the ones
that are unmarked get marked, and all of the ones that are marked get
removed. Access to an entry clears any mark. So anything that's not
touched for more than 10 minutes starts dropping out of backend
caches.
Anyway, that would be a much bigger change from what you are proposing
here, and what you are proposing here seems reasonable so I guess I
shouldn't distract from it. Your email just made me think of it,
because I agree that catcache/relcache bloat is a serious issue.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote:
We could implement this by having
some process, like the background writer,
SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
every 10 minutes or so.
... on a rolling basis.
Otherwise that'll be no fun at all, especially with some of those
lovely "we kept getting errors so we raised max_connections to 5000"
systems out there. But also on more sensibly configured ones that're
busy and want nice smooth performance without stalls.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Craig Ringer <craig@2ndquadrant.com> writes:
On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote:
We could implement this by having
some process, like the background writer,
SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
every 10 minutes or so.
... on a rolling basis.
I don't understand why we'd make that a system-wide behavior at all,
rather than expecting each process to manage its own cache.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote:
We could implement this by having
some process, like the background writer,
SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
every 10 minutes or so.... on a rolling basis.
I don't understand why we'd make that a system-wide behavior at all,
rather than expecting each process to manage its own cache.
Individual backends don't have a really great way to do time-based
stuff, do they? I mean, yes, there is enable_timeout() and friends,
but I think that requires quite a bit of bookkeeping.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't understand why we'd make that a system-wide behavior at all,
rather than expecting each process to manage its own cache.
Individual backends don't have a really great way to do time-based
stuff, do they? I mean, yes, there is enable_timeout() and friends,
but I think that requires quite a bit of bookkeeping.
If I thought that "every ten minutes" was an ideal way to manage this,
I might worry about that, but it doesn't really sound promising at all.
Every so many queries would likely work better, or better yet make it
self-adaptive depending on how much is in the local syscache.
The bigger picture here though is that we used to have limits on syscache
size, and we got rid of them (commit 8b9bc234a, see also
/messages/by-id/5141.1150327541@sss.pgh.pa.us)
not only because of the problem you mentioned about performance falling
off a cliff once the working-set size exceeded the arbitrary limit, but
also because enforcing the limit added significant overhead --- and did so
whether or not you got any benefit from it, ie even if the limit is never
reached. Maybe the present patch avoids imposing a pile of overhead in
situations where no pruning is needed, but it doesn't really look very
promising from that angle in a quick once-over.
BTW, I don't see the point of the second patch at all? Surely, if
an object is deleted or updated, we already have code that flushes
related catcache entries. Otherwise the caches would deliver wrong
data.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Dec 20, 2016 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't understand why we'd make that a system-wide behavior at all,
rather than expecting each process to manage its own cache.Individual backends don't have a really great way to do time-based
stuff, do they? I mean, yes, there is enable_timeout() and friends,
but I think that requires quite a bit of bookkeeping.If I thought that "every ten minutes" was an ideal way to manage this,
I might worry about that, but it doesn't really sound promising at all.
Every so many queries would likely work better, or better yet make it
self-adaptive depending on how much is in the local syscache.
I don't think "every so many queries" is very promising at all.
First, it has the same problem as a fixed cap on the number of
entries: if you're doing a round-robin just slightly bigger than that
value, performance will be poor. Second, what's really important here
is to keep the percentage of wall-clock time spent populating the
system caches small. If a backend is doing 4000 queries/second and
each of those 4000 queries touches a different table, it really needs
a cache of at least 4000 entries or it will thrash and slow way down.
But if it's doing a query every 10 minutes and those queries
round-robin between 4000 different tables, it doesn't really need a
4000-entry cache. If those queries are long-running, the time to
repopulate the cache will only be a tiny fraction of runtime. If the
queries are short-running, then the effect is, percentage-wise, just
the same as for the high-volume system, but in practice it isn't
likely to be felt as much. I mean, if we keep a bunch of old cache
entries around on a mostly-idle backend, they are going to be pushed
out of CPU caches and maybe even paged out. One can't expect a
backend that is woken up after a long sleep to be quite as snappy as
one that's continuously active.
Which gets to my third point: anything that's based on number of
queries won't do anything to help the case where backends sometimes go
idle and sit there for long periods. Reducing resource utilization in
that case would be beneficial. Ideally I'd like to get rid of not
only the backend-local cache contents but the backend itself, but
that's a much harder project.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for the discussion.
At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23492.1482264621@sss.pgh.pa.us>
The bigger picture here though is that we used to have limits on syscache
size, and we got rid of them (commit 8b9bc234a, see also
/messages/by-id/5141.1150327541@sss.pgh.pa.us)
not only because of the problem you mentioned about performance falling
off a cliff once the working-set size exceeded the arbitrary limit, but
also because enforcing the limit added significant overhead --- and did so
whether or not you got any benefit from it, ie even if the limit is never
reached. Maybe the present patch avoids imposing a pile of overhead in
situations where no pruning is needed, but it doesn't really look very
promising from that angle in a quick once-over.
Indeed. As mentioned in the mail at the beginning of this thread,
it hits the whole-cache scanning if at least one negative cache
exists even it is not in a relation with the target relid, and it
can be significantly long on a fat cache.
Lists of negative entries like CatCacheList would help but needs
additional memeory.
BTW, I don't see the point of the second patch at all? Surely, if
an object is deleted or updated, we already have code that flushes
related catcache entries. Otherwise the caches would deliver wrong
data.
Maybe you take the patch wrongly. Negative entires won't be
flushed by any means. Deletion of a namespace causes cascaded
object deletion according to dependency then finaly goes to
non-neative cache invalidation. But a removal of *negative
entries* in RELNAMENSP won't happen.
The test script for the case (gen2.pl) does the following thing,
CREATE SCHEMA foo;
SELECT * FROM foo.invalid;
DROP SCHEMA foo;
Removing the schema foo leaves a negative cache entry for
'foo.invalid' in RELNAMENSP.
However, I'm not sure the above situation happens so frequent
that it is worthwhile to amend.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Wed, 21 Dec 2016 10:21:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161221.102109.51106943.horiguchi.kyotaro@lab.ntt.co.jp>
At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23492.1482264621@sss.pgh.pa.us>
The bigger picture here though is that we used to have limits on syscache
size, and we got rid of them (commit 8b9bc234a, see also
/messages/by-id/5141.1150327541@sss.pgh.pa.us)
not only because of the problem you mentioned about performance falling
off a cliff once the working-set size exceeded the arbitrary limit, but
also because enforcing the limit added significant overhead --- and did so
whether or not you got any benefit from it, ie even if the limit is never
reached. Maybe the present patch avoids imposing a pile of overhead in
situations where no pruning is needed, but it doesn't really look very
promising from that angle in a quick once-over.Indeed. As mentioned in the mail at the beginning of this thread,
it hits the whole-cache scanning if at least one negative cache
exists even it is not in a relation with the target relid, and it
can be significantly long on a fat cache.Lists of negative entries like CatCacheList would help but needs
additional memeory.BTW, I don't see the point of the second patch at all? Surely, if
an object is deleted or updated, we already have code that flushes
related catcache entries. Otherwise the caches would deliver wrong
data.Maybe you take the patch wrongly. Negative entires won't be
flushed by any means. Deletion of a namespace causes cascaded
object deletion according to dependency then finaly goes to
non-neative cache invalidation. But a removal of *negative
entries* in RELNAMENSP won't happen.The test script for the case (gen2.pl) does the following thing,
CREATE SCHEMA foo;
SELECT * FROM foo.invalid;
DROP SCHEMA foo;Removing the schema foo leaves a negative cache entry for
'foo.invalid' in RELNAMENSP.However, I'm not sure the above situation happens so frequent
that it is worthwhile to amend.
Since 1753b1b conflicts this patch, I rebased this onto the
current master HEAD. I'll register this to the next CF.
The points of discussion are the following, I think.
1. The first patch seems working well. It costs the time to scan
the whole of a catcache that have negative entries for other
reloids. However, such negative entries are created by rather
unusual usages. Accesing to undefined columns, and accessing
columns on which no statistics have created. The
whole-catcache scan occurs on ATTNAME, ATTNUM and
STATRELATTINH for every invalidation of a relcache entry.
2. The second patch also works, but flushing negative entries by
hash values is inefficient. It scans the bucket corresponding
to given hash value for OIDs, then flushing negative entries
iterating over all the collected OIDs. So this costs more time
than 1 and flushes involving entries that is not necessary to
be removed. If this feature is valuable but such side effects
are not acceptable, new invalidation category based on
cacheid-oid pair would be needed.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Wed, Dec 21, 2016 at 5:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If I thought that "every ten minutes" was an ideal way to manage this,
I might worry about that, but it doesn't really sound promising at all.
Every so many queries would likely work better, or better yet make it
self-adaptive depending on how much is in the local syscache.The bigger picture here though is that we used to have limits on syscache
size, and we got rid of them (commit 8b9bc234a, see also
/messages/by-id/5141.1150327541@sss.pgh.pa.us)
not only because of the problem you mentioned about performance falling
off a cliff once the working-set size exceeded the arbitrary limit, but
also because enforcing the limit added significant overhead --- and did so
whether or not you got any benefit from it, ie even if the limit is never
reached. Maybe the present patch avoids imposing a pile of overhead in
situations where no pruning is needed, but it doesn't really look very
promising from that angle in a quick once-over.
Have there been ever discussions about having catcache entries in a
shared memory area? This does not sound much performance-wise, I am
just wondering about the concept and I cannot find references to such
discussions.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
Have there been ever discussions about having catcache entries in a
shared memory area? This does not sound much performance-wise, I am
just wondering about the concept and I cannot find references to such
discussions.
I'm sure it's been discussed. Offhand I remember the following issues:
* A shared cache would create locking and contention overhead.
* A shared cache would have a very hard size limit, at least if it's
in SysV-style shared memory (perhaps DSM would let us relax that).
* Transactions that are doing DDL have a requirement for the catcache
to reflect changes that they've made locally but not yet committed,
so said changes mustn't be visible globally.
You could possibly get around the third point with a local catcache that's
searched before the shared one, but tuning that to be performant sounds
like a mess. Also, I'm not sure how such a structure could cope with
uncommitted deletions: delete A -> remove A from local catcache, but not
the shared one -> search for A in local catcache -> not found -> search
for A in shared catcache -> found -> oops.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 13, 2017 at 8:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
Have there been ever discussions about having catcache entries in a
shared memory area? This does not sound much performance-wise, I am
just wondering about the concept and I cannot find references to such
discussions.I'm sure it's been discussed. Offhand I remember the following issues:
* A shared cache would create locking and contention overhead.
* A shared cache would have a very hard size limit, at least if it's
in SysV-style shared memory (perhaps DSM would let us relax that).* Transactions that are doing DDL have a requirement for the catcache
to reflect changes that they've made locally but not yet committed,
so said changes mustn't be visible globally.You could possibly get around the third point with a local catcache that's
searched before the shared one, but tuning that to be performant sounds
like a mess. Also, I'm not sure how such a structure could cope with
uncommitted deletions: delete A -> remove A from local catcache, but not
the shared one -> search for A in local catcache -> not found -> search
for A in shared catcache -> found -> oops.
I think the first of those concerns is the key one. If searching the
system catalogs costs $100 and searching the private catcache costs
$1, what's the cost of searching a hypothetical shared catcache? If
the answer is $80, it's not worth doing. If the answer is $5, it's
probably still not worth doing. If the answer is $1.25, then it's
probably worth investing some energy into trying to solve the other
problems you list. For some users, the memory cost of catcache and
syscache entries multiplied by N backends are a very serious problem,
so it would be nice to have some other options. But we do so many
syscache lookups that a shared cache won't be viable unless it's
almost as fast as a backend-private cache, or at least that's my
hunch.
I think it would be interested for somebody to build a prototype here
that ignores all the problems but the first and uses some
straightforward, relatively unoptimized locking strategy for the first
problem. Then benchmark it. If the results show that the idea has
legs, then we can try to figure out what a real implementation would
look like.
(One possible approach: use Thomas Munro's DHT stuff to build the shared cache.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jan 14, 2017 at 12:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 13, 2017 at 8:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
Have there been ever discussions about having catcache entries in a
shared memory area? This does not sound much performance-wise, I am
just wondering about the concept and I cannot find references to such
discussions.I'm sure it's been discussed. Offhand I remember the following issues:
* A shared cache would create locking and contention overhead.
* A shared cache would have a very hard size limit, at least if it's
in SysV-style shared memory (perhaps DSM would let us relax that).* Transactions that are doing DDL have a requirement for the catcache
to reflect changes that they've made locally but not yet committed,
so said changes mustn't be visible globally.You could possibly get around the third point with a local catcache that's
searched before the shared one, but tuning that to be performant sounds
like a mess. Also, I'm not sure how such a structure could cope with
uncommitted deletions: delete A -> remove A from local catcache, but not
the shared one -> search for A in local catcache -> not found -> search
for A in shared catcache -> found -> oops.I think the first of those concerns is the key one. If searching the
system catalogs costs $100 and searching the private catcache costs
$1, what's the cost of searching a hypothetical shared catcache? If
the answer is $80, it's not worth doing. If the answer is $5, it's
probably still not worth doing. If the answer is $1.25, then it's
probably worth investing some energy into trying to solve the other
problems you list. For some users, the memory cost of catcache and
syscache entries multiplied by N backends are a very serious problem,
so it would be nice to have some other options. But we do so many
syscache lookups that a shared cache won't be viable unless it's
almost as fast as a backend-private cache, or at least that's my
hunch.
Being able to switch from one mode to another would be interesting.
Applications using extensing DDLs that require to change the catcache
with an exclusive lock would clearly pay the lock contention cost, but
do you think that be really the case of a shared lock? A bunch of
applications that I work with deploy Postgres once, then don't change
the schema except when an upgrade happens. So that would be benefitial
for that. There are even some apps that do not use pgbouncer, but drop
sessions after a timeout of inactivity to avoid a memory bloat because
of the problem of this thread. That won't solve the problem of the
local catcache bloat, but some users using few DDLs may be fine to pay
some extra concurrency cost if the session handling gets easied.
I think it would be interested for somebody to build a prototype here
that ignores all the problems but the first and uses some
straightforward, relatively unoptimized locking strategy for the first
problem. Then benchmark it. If the results show that the idea has
legs, then we can try to figure out what a real implementation would
look like.
(One possible approach: use Thomas Munro's DHT stuff to build the shared cache.)
Yeah, I'd bet on a couple of days of focus to sort that out.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
... There are even some apps that do not use pgbouncer, but drop
sessions after a timeout of inactivity to avoid a memory bloat because
of the problem of this thread.
Yeah, a certain company I used to work for had to do that, though their
problem had more to do with bloat in plpgsql's compiled-functions cache
(and ensuing bloat in the plancache), I believe.
Still, I'm pretty suspicious of anything that will add overhead to
catcache lookups. If you think the performance of those is not absolutely
critical, turning off the caches via -DCLOBBER_CACHE_ALWAYS will soon
disabuse you of the error.
I'm inclined to think that a more profitable direction to look in is
finding a way to limit the cache size. I know we got rid of exactly that
years ago, but the problems with it were (a) the mechanism was itself
pretty expensive --- a global-to-all-caches LRU list IIRC, and (b) there
wasn't a way to tune the limit. Possibly somebody can think of some
cheaper, perhaps less precise way of aging out old entries. As for
(b), this is the sort of problem we made GUCs for.
But, again, the catcache isn't the only source of per-process bloat
and I'm not even sure it's the main one. A more holistic approach
might be called for.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2017-01-13 17:58:41 -0500, Tom Lane wrote:
But, again, the catcache isn't the only source of per-process bloat
and I'm not even sure it's the main one. A more holistic approach
might be called for.
It'd be helpful if we'd find a way to make it easy to get statistics
about the size of various caches in production systems. Right now that's
kinda hard, resulting in us having to make a lot of guesses...
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/14/2017 12:06 AM, Andres Freund wrote:
Hi,
On 2017-01-13 17:58:41 -0500, Tom Lane wrote:
But, again, the catcache isn't the only source of per-process bloat
and I'm not even sure it's the main one. A more holistic approach
might be called for.It'd be helpful if we'd find a way to make it easy to get statistics
about the size of various caches in production systems. Right now
that's kinda hard, resulting in us having to make a lot of
guesses...
What about a simple C extension, that could inspect those caches?
Assuming it could be loaded into a single backend, that should be
relatively acceptable way (compared to loading it to all backends using
shared_preload_libraries).
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jan 14, 2017 at 9:36 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 01/14/2017 12:06 AM, Andres Freund wrote:
On 2017-01-13 17:58:41 -0500, Tom Lane wrote:
But, again, the catcache isn't the only source of per-process bloat
and I'm not even sure it's the main one. A more holistic approach
might be called for.It'd be helpful if we'd find a way to make it easy to get statistics
about the size of various caches in production systems. Right now
that's kinda hard, resulting in us having to make a lot of
guesses...What about a simple C extension, that could inspect those caches? Assuming
it could be loaded into a single backend, that should be relatively
acceptable way (compared to loading it to all backends using
shared_preload_libraries).
This extension could do a small amount of work on a portion of the
syscache entries at each query loop, still I am wondering if that
would not be nicer to get that in-core and configurable, which is
basically the approach proposed by Horiguchi-san. At least it seems to
me that it has some merit, and if we could make that behavior
switchable, disabled by default, that would be a win for some class of
applications. What do others think?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote:
The points of discussion are the following, I think.
1. The first patch seems working well. It costs the time to scan
the whole of a catcache that have negative entries for other
reloids. However, such negative entries are created by rather
unusual usages. Accesing to undefined columns, and accessing
columns on which no statistics have created. The
whole-catcache scan occurs on ATTNAME, ATTNUM and
STATRELATTINH for every invalidation of a relcache entry.
I took a look at this. It looks sane, though I've got a few minor
comment tweaks:
+ * Remove negative cache tuples maching a partial key.
s/maching/matching/
+/* searching with a paritial key needs scanning the whole cache */
s/needs/means/
+ * a negative cache entry cannot be referenced so we can remove
s/referenced/referenced,/
I was wondering if there's a way to test the performance impact of
deleting negative entries.
2. The second patch also works, but flushing negative entries by
hash values is inefficient. It scans the bucket corresponding
to given hash value for OIDs, then flushing negative entries
iterating over all the collected OIDs. So this costs more time
than 1 and flushes involving entries that is not necessary to
be removed. If this feature is valuable but such side effects
are not acceptable, new invalidation category based on
cacheid-oid pair would be needed.
I glanced at this and it looks sane. Didn't go any farther since this
one's pretty up in the air. ISTM it'd be better to do some kind of aging
instead of patch 2.
The other (possibly naive) question I have is how useful negative
entries really are? Will Postgres regularly incur negative lookups, or
will these only happen due to user activity? I can't think of a case
where an app would need to depend on fast negative lookup (in other
words, it should be considered a bug in the app). I can see where
getting rid of them completely might be problematic, but maybe we can
just keep a relatively small number of them around. I'm thinking a
simple LRU list of X number of negative entries; when that fills you
reuse the oldest one. You'd have to pay the LRU maintenance cost on
every negative hit, but if those shouldn't be that common it shouldn't
be bad.
That might well necessitate another GUC, but it seems a lot simpler than
most of the other ideas.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
The other (possibly naive) question I have is how useful negative
entries really are? Will Postgres regularly incur negative lookups, or
will these only happen due to user activity?
It varies depending on the particular syscache, but in at least some
of them, negative cache entries are critical for performance.
See for example RelnameGetRelid(), which basically does a RELNAMENSP
cache lookup for each schema down the search path until it finds a
match. For any user table name with the standard search_path, there's
a guaranteed failure in pg_catalog before you can hope to find a match.
If we don't have negative cache entries, then *every invocation of this
function has to go to disk* (or at least to shared buffers).
It's possible that we could revise all our lookup patterns to avoid this
sort of thing. But I don't have much faith in that always being possible,
and exactly none that we won't introduce new lookup patterns that need it
in future. I spent some time, for instance, wondering if RelnameGetRelid
could use a SearchSysCacheList lookup instead, doing the lookup on table
name only and then inspecting the whole list to see which entry is
frontmost according to the current search path. But that has performance
failure modes of its own, for example if you have identical table names in
a boatload of different schemas. We do it that way for some other cases
such as function lookups, but I think it's much less likely that people
have identical function names in N schemas than that they have identical
table names in N schemas.
If you want to poke into this for particular test scenarios, building with
CATCACHE_STATS defined will yield a bunch of numbers dumped to the
postmaster log at each backend exit.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/21/17 8:54 PM, Tom Lane wrote:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
The other (possibly naive) question I have is how useful negative
entries really are? Will Postgres regularly incur negative lookups, or
will these only happen due to user activity?It varies depending on the particular syscache, but in at least some
of them, negative cache entries are critical for performance.
See for example RelnameGetRelid(), which basically does a RELNAMENSP
cache lookup for each schema down the search path until it finds a
match.
Ahh, I hadn't considered that. So one idea would be to only track
negative entries on caches where we know they're actually useful. That
might make the performance hit of some of the other ideas more
tolerable. Presumably you're much less likely to pollute the namespace
cache than some of the others.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/22/17 4:41 PM, Jim Nasby wrote:
On 1/21/17 8:54 PM, Tom Lane wrote:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
The other (possibly naive) question I have is how useful negative
entries really are? Will Postgres regularly incur negative lookups, or
will these only happen due to user activity?It varies depending on the particular syscache, but in at least some
of them, negative cache entries are critical for performance.
See for example RelnameGetRelid(), which basically does a RELNAMENSP
cache lookup for each schema down the search path until it finds a
match.Ahh, I hadn't considered that. So one idea would be to only track
negative entries on caches where we know they're actually useful. That
might make the performance hit of some of the other ideas more
tolerable. Presumably you're much less likely to pollute the namespace
cache than some of the others.
Ok, after reading the code I see I only partly understood what you were
saying. In any case, it might still be useful to do some testing with
CATCACHE_STATS defined to see if there's caches that don't accumulate a
lot of negative entries.
Attached is a patch that tries to document some of this.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
Attachments:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
Ahh, I hadn't considered that. So one idea would be to only track
negative entries on caches where we know they're actually useful. That
might make the performance hit of some of the other ideas more
tolerable. Presumably you're much less likely to pollute the namespace
cache than some of the others.
Ok, after reading the code I see I only partly understood what you were
saying. In any case, it might still be useful to do some testing with
CATCACHE_STATS defined to see if there's caches that don't accumulate a
lot of negative entries.
There definitely are, according to my testing, but by the same token
it's not clear that a shutoff check would save anything.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/22/17 5:03 PM, Tom Lane wrote:
Ok, after reading the code I see I only partly understood what you were
saying. In any case, it might still be useful to do some testing with
CATCACHE_STATS defined to see if there's caches that don't accumulate a
lot of negative entries.There definitely are, according to my testing, but by the same token
it's not clear that a shutoff check would save anything.
Currently they wouldn't, but there's concerns about the performance of
some of the other ideas in this thread. Getting rid of negative entries
that don't really help could reduce some of those concerns. Or perhaps
the original complaint about STATRELATTINH could be solved by just
disabling negative entries on that cache.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/21/17 6:42 PM, Jim Nasby wrote:
On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote:
The points of discussion are the following, I think.
1. The first patch seems working well. It costs the time to scan
the whole of a catcache that have negative entries for other
reloids. However, such negative entries are created by rather
unusual usages. Accesing to undefined columns, and accessing
columns on which no statistics have created. The
whole-catcache scan occurs on ATTNAME, ATTNUM and
STATRELATTINH for every invalidation of a relcache entry.I took a look at this. It looks sane, though I've got a few minor
comment tweaks:+ * Remove negative cache tuples maching a partial key.
s/maching/matching/+/* searching with a paritial key needs scanning the whole cache */
s/needs/means/
+ * a negative cache entry cannot be referenced so we can remove
s/referenced/referenced,/
I was wondering if there's a way to test the performance impact of
deleting negative entries.
I did a make installcheck run with CATCACHE_STATS to see how often we
get negative entries in the 3 caches affected by this patch. The caches
on pg_attribute get almost no negative entries. pg_statistic gets a good
amount of negative entries, presumably because we start off with no
entries in there. On a stable system that presumably won't be an issue,
but if temporary tables are in use and being analyzed I'd think there
could be a moderate amount of inval traffic on that cache. I'll leave it
to a committer to decide if they thing that's an issue, but you might
want to try and quantify how big a hit that is. I think it'd also be
useful to know how much bloat you were seeing in the field.
The patch is currently conflicting against master though, due to some
caches being added. Can you rebase? BTW, if you set a slightly larger
context size on the patch you might be able to avoid rebases; right now
the patch doesn't include enough context to uniquely identify the chunks
against cacheinfo[].
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, thank you for lookin this.
At Mon, 23 Jan 2017 16:54:36 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in <21803f50-a823-c444-ee2b-9a153114f454@BlueTreble.com>
On 1/21/17 6:42 PM, Jim Nasby wrote:
On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote:
The points of discussion are the following, I think.
1. The first patch seems working well. It costs the time to scan
the whole of a catcache that have negative entries for other
reloids. However, such negative entries are created by rather
unusual usages. Accesing to undefined columns, and accessing
columns on which no statistics have created. The
whole-catcache scan occurs on ATTNAME, ATTNUM and
STATRELATTINH for every invalidation of a relcache entry.I took a look at this. It looks sane, though I've got a few minor
comment tweaks:+ * Remove negative cache tuples maching a partial key.
s/maching/matching/+/* searching with a paritial key needs scanning the whole cache */
s/needs/means/
+ * a negative cache entry cannot be referenced so we can remove
s/referenced/referenced,/
I was wondering if there's a way to test the performance impact of
deleting negative entries.
Thanks for the pointing out. These are addressed.
I did a make installcheck run with CATCACHE_STATS to see how often we
get negative entries in the 3 caches affected by this patch. The
caches on pg_attribute get almost no negative entries. pg_statistic
gets a good amount of negative entries, presumably because we start
off with no entries in there. On a stable system that presumably won't
be an issue, but if temporary tables are in use and being analyzed I'd
think there could be a moderate amount of inval traffic on that
cache. I'll leave it to a committer to decide if they thing that's an
issue, but you might want to try and quantify how big a hit that is. I
think it'd also be useful to know how much bloat you were seeing in
the field.The patch is currently conflicting against master though, due to some
caches being added. Can you rebase?
Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.
BTW, if you set a slightly larger
context size on the patch you might be able to avoid rebases; right
now the patch doesn't include enough context to uniquely identify the
chunks against cacheinfo[].
git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm
not sure that such a hunk can avoid rebases. Is this what you
suggested? -U4 added an identifiable forward context line for
some elements so the attached patch is made with four context
lines.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hello,
I have tried to cap the number of negative entries for myself (by
removing negative entries in least recentrly created first order)
but the ceils cannot be reasonably determined both absolutely or
relatively to positive entries. Apparently it differs widely
among caches and applications.
At Mon, 23 Jan 2017 08:16:49 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in <6519b7ad-0aa6-c9f4-8869-20691107fb69@BlueTreble.com>
On 1/22/17 5:03 PM, Tom Lane wrote:
Ok, after reading the code I see I only partly understood what you
were
saying. In any case, it might still be useful to do some testing with
CATCACHE_STATS defined to see if there's caches that don't accumulate
a
lot of negative entries.There definitely are, according to my testing, but by the same token
it's not clear that a shutoff check would save anything.Currently they wouldn't, but there's concerns about the performance of
some of the other ideas in this thread. Getting rid of negative
entries that don't really help could reduce some of those concerns. Or
perhaps the original complaint about STATRELATTINH could be solved by
just disabling negative entries on that cache.
As for STATRELATTINH, planning involving small temporary tables
that frequently accessed willget benefit from negative entries,
but it might ignorably small. ATTNAME, ATTNUM and RENAMENSP also
might not get so much from negative entries. If these are true,
the whole stuff this patch adds can be replaced with just a
boolean in cachedesc that inhibits negatvie entries. Anyway this
patch don't save the case of the cache bloat relaed to function
reference. I'm not sure how that could be reproduced, though.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.
To be honest, I am not completely sure what to think about this patch.
Moved to next CF as there is a new version, and no new reviews to make
the discussion perhaps move on.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, thank you for moving this to the next CF.
At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.To be honest, I am not completely sure what to think about this patch.
Moved to next CF as there is a new version, and no new reviews to make
the discussion perhaps move on.
I'm thinking the following is the status of this topic.
- The patch stll is not getting conflicted.
- This is not a hollistic measure for memory leak but surely
saves some existing cases.
- Shared catcache is another discussion (and won't really
proposed in a short time due to the issue on locking.)
- As I mentioned, a patch that caps the number of negative
entries is avaiable (in first-created - first-delete manner)
but it is having a loose end of how to determine the
limitation.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
Hello, thank you for moving this to the next CF.
At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.To be honest, I am not completely sure what to think about this patch.
Moved to next CF as there is a new version, and no new reviews to make
the discussion perhaps move on.I'm thinking the following is the status of this topic.
- The patch stll is not getting conflicted.
- This is not a hollistic measure for memory leak but surely
saves some existing cases.- Shared catcache is another discussion (and won't really
proposed in a short time due to the issue on locking.)- As I mentioned, a patch that caps the number of negative
entries is avaiable (in first-created - first-delete manner)
but it is having a loose end of how to determine the
limitation.
While preventing bloat in the syscache is a worthwhile goal, it appears
there are a number of loose ends here and a new patch has not been provided.
It's a pretty major change so I recommend moving this patch to the
2017-07 CF.
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/3/17 4:54 PM, David Steele wrote:
On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
Hello, thank you for moving this to the next CF.
At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.To be honest, I am not completely sure what to think about this patch.
Moved to next CF as there is a new version, and no new reviews to make
the discussion perhaps move on.I'm thinking the following is the status of this topic.
- The patch stll is not getting conflicted.
- This is not a hollistic measure for memory leak but surely
saves some existing cases.- Shared catcache is another discussion (and won't really
proposed in a short time due to the issue on locking.)- As I mentioned, a patch that caps the number of negative
entries is avaiable (in first-created - first-delete manner)
but it is having a loose end of how to determine the
limitation.While preventing bloat in the syscache is a worthwhile goal, it appears
there are a number of loose ends here and a new patch has not been provided.It's a pretty major change so I recommend moving this patch to the
2017-07 CF.
Not hearing any opinions pro or con, I'm moving this patch to the
2017-07 CF.
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Tue, 7 Mar 2017 19:23:14 -0800, David Steele <david@pgmasters.net> wrote in <3b7b7f90-db46-8c37-c4f7-443330c3ae33@pgmasters.net>
On 3/3/17 4:54 PM, David Steele wrote:
On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
Hello, thank you for moving this to the next CF.
At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier
<michael.paquier@gmail.com> wrote in
<CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.To be honest, I am not completely sure what to think about this patch.
Moved to next CF as there is a new version, and no new reviews to make
the discussion perhaps move on.I'm thinking the following is the status of this topic.
- The patch stll is not getting conflicted.
- This is not a hollistic measure for memory leak but surely
saves some existing cases.- Shared catcache is another discussion (and won't really
proposed in a short time due to the issue on locking.)- As I mentioned, a patch that caps the number of negative
entries is avaiable (in first-created - first-delete manner)
but it is having a loose end of how to determine the
limitation.While preventing bloat in the syscache is a worthwhile goal, it
appears
there are a number of loose ends here and a new patch has not been
provided.It's a pretty major change so I recommend moving this patch to the
2017-07 CF.Not hearing any opinions pro or con, I'm moving this patch to the
2017-07 CF.
Ah. Yes, I agree on this. Thanks.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/24/17 02:58, Kyotaro HORIGUCHI wrote:
BTW, if you set a slightly larger
context size on the patch you might be able to avoid rebases; right
now the patch doesn't include enough context to uniquely identify the
chunks against cacheinfo[].git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm
not sure that such a hunk can avoid rebases. Is this what you
suggested? -U4 added an identifiable forward context line for
some elements so the attached patch is made with four context
lines.
This patch needs another rebase for the upcoming commit fest.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for your attention.
At Mon, 14 Aug 2017 17:33:48 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <09fa011f-4536-b05d-0625-11f3625d8332@2ndquadrant.com>
On 1/24/17 02:58, Kyotaro HORIGUCHI wrote:
BTW, if you set a slightly larger
context size on the patch you might be able to avoid rebases; right
now the patch doesn't include enough context to uniquely identify the
chunks against cacheinfo[].git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm
not sure that such a hunk can avoid rebases. Is this what you
suggested? -U4 added an identifiable forward context line for
some elements so the attached patch is made with four context
lines.This patch needs another rebase for the upcoming commit fest.
This patch have had interferences from several commits after the
last submission. I amended this patch to follow them (up to
f97c55c), removed an unnecessary branch and edited some comments.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Mon, Aug 28, 2017 at 5:24 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
This patch have had interferences from several commits after the
last submission. I amended this patch to follow them (up to
f97c55c), removed an unnecessary branch and edited some comments.
I think the core problem for this patch is that there's no consensus
on what approach to take. Until that somehow gets sorted out, I think
this isn't going to make any progress. Unfortunately, I don't have a
clear idea what sort of solution everybody could tolerate.
I still think that some kind of slow-expire behavior -- like a clock
hand that hits each backend every 10 minutes and expires entries not
used since the last hit -- is actually pretty sensible. It ensures
that idle or long-running backends don't accumulate infinite bloat
while still allowing the cache to grow large enough for good
performance when all entries are being regularly used. But Tom
doesn't like it. Other approaches were also discussed; none of them
seem like an obvious slam-dunk.
Turning to the patch itself, I don't know how we decide whether the
patch is worth it. Scanning the whole (potentially large) cache to
remove negative entries has a cost, mostly in CPU cycles; keeping
those negative entries around for a long time also has a cost, mostly
in memory. I don't know how to decide whether these patches will help
more people than it hurts, or the other way around -- and it's not
clear that anyone else has a good idea about that either.
Typos: funciton, paritial.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 28, 2017 at 9:24 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
This patch have had interferences from several commits after the
last submission. I amended this patch to follow them (up to
f97c55c), removed an unnecessary branch and edited some comments.
Hi Kyotaro-san,
This applies but several regression tests fail for me. Here is a
sample backtrace:
frame #3: 0x000000010f0614c0
postgres`ExceptionalCondition(conditionName="!(attnum < 0 ? attnum ==
(-2) : cache->cc_tupdesc->attrs[attnum].atttypid == 26)",
errorType="FailedAssertion", fileName="catcache.c", lineNumber=1384) +
128 at assert.c:54
frame #4: 0x000000010f03b5fd
postgres`CollectOIDsForHashValue(cache=0x00007fe273821268,
hashValue=994410284, attnum=0) + 253 at catcache.c:1383
frame #5: 0x000000010f055e8e
postgres`SysCacheSysCacheInvalCallback(arg=140610577303984, cacheid=0,
hashValue=994410284) + 94 at syscache.c:1692
frame #6: 0x000000010f03fbbb
postgres`CallSyscacheCallbacks(cacheid=0, hashvalue=994410284) + 219
at inval.c:1468
frame #7: 0x000000010f03f878
postgres`LocalExecuteInvalidationMessage(msg=0x00007fff51213ff8) + 88
at inval.c:566
frame #8: 0x000000010ee7a3f2
postgres`ReceiveSharedInvalidMessages(invalFunction=(postgres`LocalExecuteInvalidationMessage
at inval.c:555), resetFunction=(postgres`InvalidateSystemCaches at
inval.c:647)) + 354 at sinval.c:121
frame #9: 0x000000010f03fcb7 postgres`AcceptInvalidationMessages +
23 at inval.c:686
frame #10: 0x000000010eade609 postgres`AtStart_Cache + 9 at xact.c:987
frame #11: 0x000000010ead8c2f postgres`StartTransaction + 655 at xact.c:1921
frame #12: 0x000000010ead8896 postgres`StartTransactionCommand +
70 at xact.c:2691
frame #13: 0x000000010eea9746 postgres`start_xact_command + 22 at
postgres.c:2438
frame #14: 0x000000010eea722e
postgres`exec_simple_query(query_string="RESET SESSION
AUTHORIZATION;") + 126 at postgres.c:913
frame #15: 0x000000010eea68d7 postgres`PostgresMain(argc=1,
argv=0x00007fe2738036a8, dbname="regression", username="munro") + 2375
at postgres.c:4090
frame #16: 0x000000010eded40e
postgres`BackendRun(port=0x00007fe2716001a0) + 654 at
postmaster.c:4357
frame #17: 0x000000010edec793
postgres`BackendStartup(port=0x00007fe2716001a0) + 483 at
postmaster.c:4029
frame #18: 0x000000010edeb785 postgres`ServerLoop + 597 at postmaster.c:1753
frame #19: 0x000000010ede8f71 postgres`PostmasterMain(argc=8,
argv=0x00007fe271403860) + 5553 at postmaster.c:1361
frame #20: 0x000000010ed0ccd9 postgres`main(argc=8,
argv=0x00007fe271403860) + 761 at main.c:228
frame #21: 0x00007fff8333a5ad libdyld.dylib`start + 1
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for reviewing this.
At Sat, 2 Sep 2017 12:12:47 +1200, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=3wqPFFSKP_yhkuHLZtOOwZskGuHJdSctVnbHQ4DFEH+Q@mail.gmail.com>
On Mon, Aug 28, 2017 at 9:24 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:This patch have had interferences from several commits after the
last submission. I amended this patch to follow them (up to
f97c55c), removed an unnecessary branch and edited some comments.Hi Kyotaro-san,
This applies but several regression tests fail for me. Here is a
sample backtrace:
Sorry for the silly mistake. STAEXTNAMENSP and STATRELATTINH was
missing additional elements in their definitions. Somehow I've
removed them.
The attached patch doesn't crash by regression test. And fixed
some typos pointed by Robert and found by myself.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Thank you for the comment.
At Mon, 28 Aug 2017 21:31:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com>
On Mon, Aug 28, 2017 at 5:24 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:This patch have had interferences from several commits after the
last submission. I amended this patch to follow them (up to
f97c55c), removed an unnecessary branch and edited some comments.I think the core problem for this patch is that there's no consensus
on what approach to take. Until that somehow gets sorted out, I think
this isn't going to make any progress. Unfortunately, I don't have a
clear idea what sort of solution everybody could tolerate.I still think that some kind of slow-expire behavior -- like a clock
hand that hits each backend every 10 minutes and expires entries not
used since the last hit -- is actually pretty sensible. It ensures
that idle or long-running backends don't accumulate infinite bloat
while still allowing the cache to grow large enough for good
performance when all entries are being regularly used. But Tom
doesn't like it. Other approaches were also discussed; none of them
seem like an obvious slam-dunk.
I suppose that it slows intermittent lookup of non-existent
objects. I have tried a slight different thing. Removing entries
by 'age', preserving specified number (or ratio to live entries)
of younger negative entries. The problem of that approach was I
didn't find how to determine the number of entries to preserve,
or I didn't want to offer additional knobs for them. Finally I
proposed the patch upthread since it doesn't need any assumption
on usage.
Though I can make another patch that does the same thing based on
LRU, the same how-many-to-preserve problem ought to be resolved
in order to avoid slowing the inermittent lookup.
Turning to the patch itself, I don't know how we decide whether the
patch is worth it. Scanning the whole (potentially large) cache to
remove negative entries has a cost, mostly in CPU cycles; keeping
those negative entries around for a long time also has a cost, mostly
in memory. I don't know how to decide whether these patches will help
more people than it hurts, or the other way around -- and it's not
clear that anyone else has a good idea about that either.
Scanning a hash on invalidation of several catalogs (hopefully
slightly) slows certain percentage of inavlidations on maybe most
of workloads. Holding no-longer-lookedup entries surely kills a
backend under certain workloads sooner or later. This doesn't
save the pg_proc cases, but saves pg_statistic and pg_class
cases. I'm not sure what other catalogs can bloat.
I could reduce the complexity of this. Inval mechanism conveys
only a hash value so this scans the whole of a cache for the
target OIDs (with possible spurious targets). This will be
resolved by letting inval mechanism convey an OID. (but this may
need additional members in an inval entry.)
Still, the full scan perfomed in CleanupCatCacheNegEntries
doesn't seem easily avoidable. Separating the hash by OID of key
or provide special dlist that points tuples in buckets will
introduce another complexity.
Typos: funciton, paritial.
Thanks. ispell told me of additional typos corresnpond, belive
and undistinguisable.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
This is a rebased version of the patch.
At Fri, 17 Mar 2017 14:23:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170317.142313.232290068.horiguchi.kyotaro@lab.ntt.co.jp>
At Tue, 7 Mar 2017 19:23:14 -0800, David Steele <david@pgmasters.net> wrote in <3b7b7f90-db46-8c37-c4f7-443330c3ae33@pgmasters.net>
On 3/3/17 4:54 PM, David Steele wrote:
On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
Hello, thank you for moving this to the next CF.
At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier
<michael.paquier@gmail.com> wrote in
<CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.To be honest, I am not completely sure what to think about this patch.
Moved to next CF as there is a new version, and no new reviews to make
the discussion perhaps move on.I'm thinking the following is the status of this topic.
- The patch stll is not getting conflicted.
- This is not a hollistic measure for memory leak but surely
saves some existing cases.- Shared catcache is another discussion (and won't really
proposed in a short time due to the issue on locking.)- As I mentioned, a patch that caps the number of negative
entries is avaiable (in first-created - first-delete manner)
but it is having a loose end of how to determine the
limitation.While preventing bloat in the syscache is a worthwhile goal, it
appears
there are a number of loose ends here and a new patch has not been
provided.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
This is a rebased version of the patch.
As far as I can see, the patch still applies, compiles, and got no
reviews. So moved to next CF.
--
Michael
On Wed, Nov 29, 2017 at 8:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:This is a rebased version of the patch.
As far as I can see, the patch still applies, compiles, and got no
reviews. So moved to next CF.
I think we have to mark this as returned with feedback or rejected for
the reasons mentioned here:
/messages/by-id/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Nov 30, 2017 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Nov 29, 2017 at 8:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:This is a rebased version of the patch.
As far as I can see, the patch still applies, compiles, and got no
reviews. So moved to next CF.I think we have to mark this as returned with feedback or rejected for
the reasons mentioned here:/messages/by-id/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com
Good point. I forgot this bit. Thanks for mentioning it I am switching
the patch as returned with feedback.
--
Michael
Michael Paquier <michael.paquier@gmail.com> writes:
On Thu, Nov 30, 2017 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think we have to mark this as returned with feedback or rejected for
the reasons mentioned here:
/messages/by-id/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com
Good point. I forgot this bit. Thanks for mentioning it I am switching
the patch as returned with feedback.
We had a bug report just today that seemed to me to trace to relcache
bloat:
/messages/by-id/20171129100649.1473.73990@wrigleys.postgresql.org
ISTM that there's definitely work to be done here, but as I said upthread,
I think we need a more holistic approach than just focusing on negative
catcache entries, or even just catcache entries.
The thing that makes me uncomfortable about this is that we used to have a
catcache size limitation mechanism, and ripped it out because it had too
much overhead (see commit 8b9bc234a). I'm not sure how we can avoid that
problem within a fresh implementation.
regards, tom lane
On Wed, Nov 29, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The thing that makes me uncomfortable about this is that we used to have a
catcache size limitation mechanism, and ripped it out because it had too
much overhead (see commit 8b9bc234a). I'm not sure how we can avoid that
problem within a fresh implementation.
At the risk of beating a dead horse, I still think that the amount of
wall clock time that has elapsed since an entry was last accessed is
very relevant. The problem with a fixed maximum size is that you can
hit it arbitrarily frequently; time-based expiration solves that
problem. It allows backends that are actively using a lot of stuff to
hold on to as many cache entries as they need, while forcing backends
that have moved on to a different set of tables -- or that are
completely idle -- to let go of cache entries that are no longer being
actively used. I think that's what we want. Nobody wants to keep the
cache size small when a big cache is necessary for good performance,
but what people do want to avoid is having long-running backends
eventually accumulate huge numbers of cache entries most of which
haven't been touched in hours or, maybe, weeks.
To put that another way, we should only hang on to a cache entry for
so long as the bytes of memory that it consumes are more valuable than
some other possible use of those bytes of memory. That is very likely
to be true when we've accessed those bytes recently, but progressively
less likely to be true the more time has passed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 29, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The thing that makes me uncomfortable about this is that we used to have a
catcache size limitation mechanism, and ripped it out because it had too
much overhead (see commit 8b9bc234a). I'm not sure how we can avoid that
problem within a fresh implementation.
At the risk of beating a dead horse, I still think that the amount of
wall clock time that has elapsed since an entry was last accessed is
very relevant.
While I don't object to that statement, I'm not sure how it helps us
here. If we couldn't afford DLMoveToFront(), doing a gettimeofday()
during each syscache access is surely right out.
regards, tom lane
On Thu, Nov 30, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 29, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The thing that makes me uncomfortable about this is that we used to have a
catcache size limitation mechanism, and ripped it out because it had too
much overhead (see commit 8b9bc234a). I'm not sure how we can avoid that
problem within a fresh implementation.At the risk of beating a dead horse, I still think that the amount of
wall clock time that has elapsed since an entry was last accessed is
very relevant.While I don't object to that statement, I'm not sure how it helps us
here. If we couldn't afford DLMoveToFront(), doing a gettimeofday()
during each syscache access is surely right out.
Well, yeah, that would be insane. But I think even something very
rough could work well enough. I think our goal should be to eliminate
cache entries that are have gone unused for many *minutes*, and
there's no urgency about getting it to any sort of exact value. For
non-idle backends, using the most recent statement start time as a
proxy would probably be plenty good enough. Idle backends might need
a bit more thought.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2017-12-01 16:20:44 -0500, Robert Haas wrote:
Well, yeah, that would be insane. But I think even something very
rough could work well enough. I think our goal should be to eliminate
cache entries that are have gone unused for many *minutes*, and
there's no urgency about getting it to any sort of exact value. For
non-idle backends, using the most recent statement start time as a
proxy would probably be plenty good enough. Idle backends might need
a bit more thought.
Our timer framework is flexible enough that we can install a
once-a-minute timer without much overhead. That timer could increment a
'cache generation' integer. Upon cache access we write the current
generation into relcache / syscache (and potentially also plancache?)
entries. Not entirely free, but cheap enough. In those once-a-minute
passes entries that haven't been touched in X cycles get pruned.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2017-12-01 16:20:44 -0500, Robert Haas wrote:
Well, yeah, that would be insane. But I think even something very
rough could work well enough. I think our goal should be to eliminate
cache entries that are have gone unused for many *minutes*, and
there's no urgency about getting it to any sort of exact value. For
non-idle backends, using the most recent statement start time as a
proxy would probably be plenty good enough. Idle backends might need
a bit more thought.
Our timer framework is flexible enough that we can install a
once-a-minute timer without much overhead. That timer could increment a
'cache generation' integer. Upon cache access we write the current
generation into relcache / syscache (and potentially also plancache?)
entries. Not entirely free, but cheap enough. In those once-a-minute
passes entries that haven't been touched in X cycles get pruned.
I have no faith in either of these proposals, because they both assume
that the problem only arises over the course of many minutes. In the
recent complaint about pg_dump causing relcache bloat, it probably does
not take nearly that long for the bloat to occur.
Maybe you could make it work on the basis of number of cache accesses,
or some other normalized-to-workload-not-wall-clock time reference.
regards, tom lane
On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2017-12-01 16:20:44 -0500, Robert Haas wrote:
Well, yeah, that would be insane. But I think even something very
rough could work well enough. I think our goal should be to eliminate
cache entries that are have gone unused for many *minutes*, and
there's no urgency about getting it to any sort of exact value. For
non-idle backends, using the most recent statement start time as a
proxy would probably be plenty good enough. Idle backends might need
a bit more thought.Our timer framework is flexible enough that we can install a
once-a-minute timer without much overhead. That timer could increment a
'cache generation' integer. Upon cache access we write the current
generation into relcache / syscache (and potentially also plancache?)
entries. Not entirely free, but cheap enough. In those once-a-minute
passes entries that haven't been touched in X cycles get pruned.I have no faith in either of these proposals, because they both assume
that the problem only arises over the course of many minutes. In the
recent complaint about pg_dump causing relcache bloat, it probably does
not take nearly that long for the bloat to occur.
To me that's a bit of a different problem than what I was discussing
here. It also actually doesn't seem that hard - if your caches are
growing fast, you'll continually get hash-resizing of the
various. Adding cache-pruning to the resizing code doesn't seem hard,
and wouldn't add meaningful overhead.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
I have no faith in either of these proposals, because they both assume
that the problem only arises over the course of many minutes. In the
recent complaint about pg_dump causing relcache bloat, it probably does
not take nearly that long for the bloat to occur.
To me that's a bit of a different problem than what I was discussing
here. It also actually doesn't seem that hard - if your caches are
growing fast, you'll continually get hash-resizing of the
various. Adding cache-pruning to the resizing code doesn't seem hard,
and wouldn't add meaningful overhead.
That's an interesting way to think about it, as well, though I'm not
sure it's quite that simple. If you tie this to cache resizing then
the cache will have to grow up to the newly increased size before
you'll prune it again. That doesn't sound like it will lead to nice
steady-state behavior.
regards, tom lane
On 2017-12-01 17:03:28 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
I have no faith in either of these proposals, because they both assume
that the problem only arises over the course of many minutes. In the
recent complaint about pg_dump causing relcache bloat, it probably does
not take nearly that long for the bloat to occur.To me that's a bit of a different problem than what I was discussing
here. It also actually doesn't seem that hard - if your caches are
growing fast, you'll continually get hash-resizing of the
various. Adding cache-pruning to the resizing code doesn't seem hard,
and wouldn't add meaningful overhead.That's an interesting way to think about it, as well, though I'm not
sure it's quite that simple. If you tie this to cache resizing then
the cache will have to grow up to the newly increased size before
you'll prune it again. That doesn't sound like it will lead to nice
steady-state behavior.
Yea, it's not perfect - but if we do pruning both at resize *and* on
regular intervals, like once-a-minute as I was suggesting, I don't think
it's that bad. The steady state won't be reached within seconds, true,
but the negative consequences of only attempting to shrink the cache
upon resizing when the cache size is growing fast anyway doesn't seem
that large.
I don't think we need to be super accurate here, there just needs to be
*some* backpressure.
I've had cases in the past where just occasionally blasting the cache
away would've been good enough.
Greetings,
Andres Freund
At Fri, 1 Dec 2017 14:12:20 -0800, Andres Freund <andres@anarazel.de> wrote in <20171201221220.z5e6wtlpl264wzik@alap3.anarazel.de>
On 2017-12-01 17:03:28 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
I have no faith in either of these proposals, because they both assume
that the problem only arises over the course of many minutes. In the
recent complaint about pg_dump causing relcache bloat, it probably does
not take nearly that long for the bloat to occur.To me that's a bit of a different problem than what I was discussing
here. It also actually doesn't seem that hard - if your caches are
growing fast, you'll continually get hash-resizing of the
various. Adding cache-pruning to the resizing code doesn't seem hard,
and wouldn't add meaningful overhead.That's an interesting way to think about it, as well, though I'm not
sure it's quite that simple. If you tie this to cache resizing then
the cache will have to grow up to the newly increased size before
you'll prune it again. That doesn't sound like it will lead to nice
steady-state behavior.Yea, it's not perfect - but if we do pruning both at resize *and* on
regular intervals, like once-a-minute as I was suggesting, I don't think
it's that bad. The steady state won't be reached within seconds, true,
but the negative consequences of only attempting to shrink the cache
upon resizing when the cache size is growing fast anyway doesn't seem
that large.I don't think we need to be super accurate here, there just needs to be
*some* backpressure.I've had cases in the past where just occasionally blasting the cache
away would've been good enough.
Thank you very much for the valuable suggestions. I still would
like to solve this problem and the
a-counter-freely-running-in-minute(or several seconds)-resolution
and pruning-too-long-unaccessed-entries-on-resizing seems to me
to work enough for at least several known bloat cases. This still
has a defect that this is not workable for a very quick
bloating. I'll try thinking about the remaining issue.
If no one has immediate objection to the direction, I'll come up
with an implementation.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Dec 13, 2017 at 11:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Thank you very much for the valuable suggestions. I still would
like to solve this problem and the
a-counter-freely-running-in-minute(or several seconds)-resolution
and pruning-too-long-unaccessed-entries-on-resizing seems to me
to work enough for at least several known bloat cases. This still
has a defect that this is not workable for a very quick
bloating. I'll try thinking about the remaining issue.
I'm not sure we should regard very quick bloating as a problem in need
of solving. Doesn't that just mean we need the cache to be bigger, at
least temporarily?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2017-12-16 22:25:48 -0500, Robert Haas wrote:
On Wed, Dec 13, 2017 at 11:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Thank you very much for the valuable suggestions. I still would
like to solve this problem and the
a-counter-freely-running-in-minute(or several seconds)-resolution
and pruning-too-long-unaccessed-entries-on-resizing seems to me
to work enough for at least several known bloat cases. This still
has a defect that this is not workable for a very quick
bloating. I'll try thinking about the remaining issue.I'm not sure we should regard very quick bloating as a problem in need
of solving. Doesn't that just mean we need the cache to be bigger, at
least temporarily?
Leaving that aside, is that actually not at least to a good degree,
solved by that problem? By bumping the generation on hash resize, we
have recency information we can take into account.
Greetings,
Andres Freund
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund <andres@anarazel.de> wrote:
I'm not sure we should regard very quick bloating as a problem in need
of solving. Doesn't that just mean we need the cache to be bigger, at
least temporarily?Leaving that aside, is that actually not at least to a good degree,
solved by that problem? By bumping the generation on hash resize, we
have recency information we can take into account.
I agree that we can do it. I'm just not totally sure it's a good
idea. I'm also not totally sure it's a bad idea, either. That's why
I asked the question.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2017-12-17 19:23:45 -0500, Robert Haas wrote:
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund <andres@anarazel.de> wrote:
I'm not sure we should regard very quick bloating as a problem in need
of solving. Doesn't that just mean we need the cache to be bigger, at
least temporarily?Leaving that aside, is that actually not at least to a good degree,
solved by that problem? By bumping the generation on hash resize, we
have recency information we can take into account.I agree that we can do it. I'm just not totally sure it's a good
idea. I'm also not totally sure it's a bad idea, either. That's why
I asked the question.
I'm not 100% convinced either - but I also don't think it matters all
that terribly much. As long as the overall hash hit rate is decent,
minor increases in the absolute number of misses don't really matter
that much for syscache imo. I'd personally go for something like:
1) When about to resize, check if there's entries of a generation -2
around.
Don't resize if more than 15% of entries could be freed. Also, stop
reclaiming at that threshold, to avoid unnecessary purging cache
entries.
Using two generations allows a bit more time for cache entries to
marked as fresh before resizing next.
2) While resizing increment generation count by one.
3) Once a minute, increment generation count by one.
The one thing I'm not quite have a good handle upon is how much, and if
any, cache reclamation to do at 3). We don't really want to throw away
all the caches just because a connection has been idle for a few
minutes, in a connection pool that can happen occasionally. I think I'd
for now *not* do any reclamation except at resize boundaries.
Greetings,
Andres Freund
On Mon, Dec 18, 2017 at 11:46 AM, Andres Freund <andres@anarazel.de> wrote:
I'm not 100% convinced either - but I also don't think it matters all
that terribly much. As long as the overall hash hit rate is decent,
minor increases in the absolute number of misses don't really matter
that much for syscache imo. I'd personally go for something like:1) When about to resize, check if there's entries of a generation -2
around.Don't resize if more than 15% of entries could be freed. Also, stop
reclaiming at that threshold, to avoid unnecessary purging cache
entries.Using two generations allows a bit more time for cache entries to
marked as fresh before resizing next.2) While resizing increment generation count by one.
3) Once a minute, increment generation count by one.
The one thing I'm not quite have a good handle upon is how much, and if
any, cache reclamation to do at 3). We don't really want to throw away
all the caches just because a connection has been idle for a few
minutes, in a connection pool that can happen occasionally. I think I'd
for now *not* do any reclamation except at resize boundaries.
My starting inclination was almost the opposite. I think that you
might be right that a minute or two of idle time isn't sufficient
reason to flush our local cache, but I'd be inclined to fix that by
incrementing the generation count every 10 minutes or so rather than
every minute, and still flush things more then 1 generation old. The
reason for that is that I think we should ensure that the system
doesn't sit there idle forever with a giant cache. If it's not using
those cache entries, I'd rather have it discard them and rebuild the
cache when it becomes active again.
Now, I also see that your point about trying to clean up before
resizing. That does seem like a good idea, although we have to be
careful not to be too eager to clean up there, or we'll just result in
artificially limiting the cache size when it's unwise to do so. But I
guess that's what you meant by "Also, stop reclaiming at that
threshold, to avoid unnecessary purging cache entries." I think the
idea you are proposing is that:
1. The first time we are due to expand the hash table, we check
whether we can forestall that expansion by doing a cleanup; if so, we
do that instead.
2. After that, we just expand.
That seems like a fairly good idea, although it might be a better idea
to allow cleanup if enough time has passed. If we hit the expansion
threshold twice an hour apart, there's no reason not to try cleanup
again.
Generally, the way I'm viewing this is that a syscache entry means
paying memory to save CPU time. Each 8kB of memory we use to store
system cache entries is one less block we have for the OS page cache
to hold onto our data blocks. If we had an oracle (the kind from
Delphi, not Redwood City) that told us with perfect accuracy when to
discard syscache entries, it would throw away syscache entries
whenever the marginal execution-time performance we could buy from
another 8kB in the page cache is greater than the marginal
execution-time performance we could buy from those syscache entries.
In reality, it's hard to know which of those things is of greater
value. If the system isn't meaningfully memory-constrained, we ought
to just always hang onto the syscache entries, as we do today, but
it's hard to know that. I think the place where this really becomes a
problem is on system with hundreds of connections + thousands of
tables + connection pooling; without some back-pressure, every backend
eventually caches everything, putting the system under severe memory
pressure for basically no performance gain. Each new use of the
connection is probably for a limited set of tables, and only those
tables really syscache entries; holding onto things used long in the
past doesn't save enough to justify the memory used.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
At Mon, 18 Dec 2017 12:14:24 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaWLBzUasvVs-q=dfBr3pLWSUCQnbqLk-MT7iX4eyrinA@mail.gmail.com>
On Mon, Dec 18, 2017 at 11:46 AM, Andres Freund <andres@anarazel.de> wrote:
I'm not 100% convinced either - but I also don't think it matters all
that terribly much. As long as the overall hash hit rate is decent,
minor increases in the absolute number of misses don't really matter
that much for syscache imo. I'd personally go for something like:1) When about to resize, check if there's entries of a generation -2
around.Don't resize if more than 15% of entries could be freed. Also, stop
reclaiming at that threshold, to avoid unnecessary purging cache
entries.Using two generations allows a bit more time for cache entries to
marked as fresh before resizing next.2) While resizing increment generation count by one.
3) Once a minute, increment generation count by one.
The one thing I'm not quite have a good handle upon is how much, and if
any, cache reclamation to do at 3). We don't really want to throw away
all the caches just because a connection has been idle for a few
minutes, in a connection pool that can happen occasionally. I think I'd
for now *not* do any reclamation except at resize boundaries.My starting inclination was almost the opposite. I think that you
might be right that a minute or two of idle time isn't sufficient
reason to flush our local cache, but I'd be inclined to fix that by
incrementing the generation count every 10 minutes or so rather than
every minute, and still flush things more then 1 generation old. The
reason for that is that I think we should ensure that the system
doesn't sit there idle forever with a giant cache. If it's not using
those cache entries, I'd rather have it discard them and rebuild the
cache when it becomes active again.
I see three kinds of syscache entries.
A. An entry for an actually existing object.
This is literally a syscache entry. This kind of entry is not
necessary to be removed but can be removed after ignorance for
a certain period of time.
B. An entry for an object which once existed but no longer.
This can be removed any time after the removal of the object
and is a main cause of stats bloat or relcache bloat which are
the motive of this thread. We can know whether the entries of
this kind are removable using cache invalidation
mechanism. (the patch upthread)
We can queue the oids that specify the entries to remove, then
actually remove at the next resize. (And this also could be
another cause of bloat. So we could forcibly flush a hash when
the oid list becomes longer than some threashold.)
C. An entry for a just non-existent objects.
I'm not sure how we should treat this since the necessity of a
entry of the kind purely stands on whether the entry will be
accessed sometime. But we could put the same assumption to A.
Now, I also see that your point about trying to clean up before
resizing. That does seem like a good idea, although we have to be
careful not to be too eager to clean up there, or we'll just result in
artificially limiting the cache size when it's unwise to do so. But I
guess that's what you meant by "Also, stop reclaiming at that
threshold, to avoid unnecessary purging cache entries." I think the
idea you are proposing is that:1. The first time we are due to expand the hash table, we check
whether we can forestall that expansion by doing a cleanup; if so, we
do that instead.2. After that, we just expand.
That seems like a fairly good idea, although it might be a better idea
to allow cleanup if enough time has passed. If we hit the expansion
threshold twice an hour apart, there's no reason not to try cleanup
again.
Aa session with intermittently executes queries run in a very
short time could be considered as an example workload where
cleanup with such criteria is unwelcomed. But syscache won't
bloat in the case.
Generally, the way I'm viewing this is that a syscache entry means
paying memory to save CPU time. Each 8kB of memory we use to store
system cache entries is one less block we have for the OS page cache
to hold onto our data blocks. If we had an oracle (the kind from
Sure
Delphi, not Redwood City) that told us with perfect accuracy when to
discard syscache entries, it would throw away syscache entries
Except for the B in the aboves. The logic seems somewhat alien to
the time-based cleanup but this can be the measure for quick
bloat of some syscahces.
whenever the marginal execution-time performance we could buy from
another 8kB in the page cache is greater than the marginal
execution-time performance we could buy from those syscache entries.
In reality, it's hard to know which of those things is of greater
value. If the system isn't meaningfully memory-constrained, we ought
to just always hang onto the syscache entries, as we do today, but
it's hard to know that. I think the place where this really becomes a
problem is on system with hundreds of connections + thousands of
tables + connection pooling; without some back-pressure, every backend
eventually caches everything, putting the system under severe memory
pressure for basically no performance gain. Each new use of the
connection is probably for a limited set of tables, and only those
tables really syscache entries; holding onto things used long in the
past doesn't save enough to justify the memory used.
Agreed. The following is the whole image of the measure for
syscache bloat considering "quick bloat". (I still think it is
wanted under some situations.)
1. If a removal of any objects that make some syscache entries
stale (this cannot be checked without scanning whole a hash so
just queue it into, for exameple, recently_removed_relations
OID hash.)
2. If the number of the oid-hash entries reasches 1000 or 10000
(mmm. quite arbitrary..), Immediately clean up syscaches that
accepts/needs removed-reloid cleanup. (The oid hash might be
needed separately for each target cache to avoid readandunt
scan, or to get rid a kind of generation management in the oid
hash.)
3.
1. The first time we are due to expand the hash table, we check
whether we can forestall that expansion by doing a cleanup; if so, we
do that instead.
And if there's any entry in the removed-reloid hash it is
considered while cleanup.
4.
2. After that, we just expand.
That seems like a fairly good idea, although it might be a better idea
to allow cleanup if enough time has passed. If we hit the expansion
threshold twice an hour apart, there's no reason not to try cleanup
again.
1 + 2 and 3 + 4 can be implemented as separate patches and I'll
do the latter first.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I see three kinds of syscache entries.
A. An entry for an actually existing object.
B. An entry for an object which once existed but no longer.
C. An entry for a just non-existent objects.
I'm not convinced that it's useful to divide things up this way.
Regardless of whether the syscache entries is a positive entry, a
negative entry for a dropped object, or a negative energy for an
object that never existed in the first place, it's valuable if it's
likely to get used again and worthless if not. Positive entries may
get used repeatedly, or not; negative entries may get used repeatedly,
or not.
Generally, the way I'm viewing this is that a syscache entry means
paying memory to save CPU time. Each 8kB of memory we use to store
system cache entries is one less block we have for the OS page cache
to hold onto our data blocks. If we had an oracle (the kind fromSure
Delphi, not Redwood City) that told us with perfect accuracy when to
discard syscache entries, it would throw away syscache entriesExcept for the B in the aboves. The logic seems somewhat alien to
the time-based cleanup but this can be the measure for quick
bloat of some syscahces.
I guess I still don't see why B is different. If somebody sits there
and runs queries against non-existent table names at top speed, maybe
they'll query the same non-existent table entries more than once, in
which case keeping the negative entries for the non-existent table
names around until they stop doing it may improve performance. If
they are sitting there and running queries against randomly-generated
non-existent table names at top speed, then they'll generate a lot of
catcache bloat, but that's not really any different from a database
with a large number of tables that DO exist which are queried at
random. Workloads that access a lot of objects, whether those objects
exist or not, are going to use up a lot of cache entries, and I guess
that just seems OK to me.
Agreed. The following is the whole image of the measure for
syscache bloat considering "quick bloat". (I still think it is
wanted under some situations.)1. If a removal of any objects that make some syscache entries
stale (this cannot be checked without scanning whole a hash so
just queue it into, for exameple, recently_removed_relations
OID hash.)
If we just let some sort of cleanup process that generally blows away
rarely-used entries get rid of those entries too, then it should
handle this case, too, because the cache entries pertaining to removed
relations (or schemas) probably won't get used after that (and if they
do, we should keep them). So I don't see that there is a need for
this, and it drew objections upthread because of the cost of scanning
the whole hash table. Batching relations together might help, but it
doesn't really seem worth trying to sort out the problems with this
idea when we can do something better and more general.
2. If the number of the oid-hash entries reasches 1000 or 10000
(mmm. quite arbitrary..), Immediately clean up syscaches that
accepts/needs removed-reloid cleanup. (The oid hash might be
needed separately for each target cache to avoid readandunt
scan, or to get rid a kind of generation management in the oid
hash.)
That is bound to draw a strong negative response from Tom, and for
good reason. If the number of relations in the working set is 1001
and your cleanup threshold is 1000, cleanups will happen constantly
and performance will be poor. This is exactly why, as I said in the
second email on this thread, the limit of on the size of the relcache
was removed.
1. The first time we are due to expand the hash table, we check
whether we can forestall that expansion by doing a cleanup; if so, we
do that instead.And if there's any entry in the removed-reloid hash it is
considered while cleanup.
As I say, I don't think there's any need for a removed-reloid hash.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:I see three kinds of syscache entries.
A. An entry for an actually existing object.
B. An entry for an object which once existed but no longer.
C. An entry for a just non-existent objects.
I'm not convinced that it's useful to divide things up this way.
Actually, I don't believe that case B exists at all; such an entry
should get blown away by syscache invalidation when we commit the
DROP command. If one were to stick around, we'd risk false positive
lookups later.
I guess I still don't see why B is different. If somebody sits there
and runs queries against non-existent table names at top speed, maybe
they'll query the same non-existent table entries more than once, in
which case keeping the negative entries for the non-existent table
names around until they stop doing it may improve performance.
FWIW, my recollection is that the reason for negative cache entries
is that there are some very common patterns where we probe for object
names (not just table names, either) that aren't there, typically as
a side effect of walking through the search_path looking for a match
to an unqualified object name. Those cache entries aren't going to
get any less useful than the positive entry for the ultimately-found
object. So from a lifespan point of view I'm not very sure that it's
worth distinguishing cases A and C.
It's conceivable that we could rewrite all the lookup algorithms
so that they didn't require negative cache entries to have good
performance ... but I doubt that that's easy to do.
regards, tom lane
At Tue, 19 Dec 2017 13:14:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <748.1513707249@sss.pgh.pa.us>
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:I see three kinds of syscache entries.
A. An entry for an actually existing object.
B. An entry for an object which once existed but no longer.
C. An entry for a just non-existent objects.I'm not convinced that it's useful to divide things up this way.
Actually, I don't believe that case B exists at all; such an entry
should get blown away by syscache invalidation when we commit the
DROP command. If one were to stick around, we'd risk false positive
lookups later.
As I have shown upthread, access to a temporary table (*1) leaves
several STATRELATTINH entries after DROPing, and it doesn't have
a chance to be deleted. SELECTing a nonexistent table in a schema
(*2) also leaves a RELNAMENSP entry after DROPing the schema. I'm
not sure that the latter happen so frequently but the former
happens rather frequently and quickly bloats the syscache once
happens. However no false positive can happen since such entiries
cannot be reached without parent objects, but on the other hand
they have no chance to be deleted.
*1: begin; create temp table t1 (a int, b int, c int, d int, e int, f int, g int, h int, i int, j int) on commit drop; insert into t1 values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); select * from t1; commit;
*2: create schema foo; select * from foo.invalid; drop schema foo;
I guess I still don't see why B is different. If somebody sits there
and runs queries against non-existent table names at top speed, maybe
they'll query the same non-existent table entries more than once, in
which case keeping the negative entries for the non-existent table
names around until they stop doing it may improve performance.FWIW, my recollection is that the reason for negative cache entries
is that there are some very common patterns where we probe for object
names (not just table names, either) that aren't there, typically as
a side effect of walking through the search_path looking for a match
to an unqualified object name. Those cache entries aren't going to
get any less useful than the positive entry for the ultimately-found
object. So from a lifespan point of view I'm not very sure that it's
worth distinguishing cases A and C.
Agreed.
It's conceivable that we could rewrite all the lookup algorithms
so that they didn't require negative cache entries to have good
performance ... but I doubt that that's easy to do.
That sounds to me to be the same as improving the performance of
systable scan as the same as local hash. Lockless systable
(index) might work (if possible)?
Anyway, I think we are reached to a consensus that the
time-tick-based expiration is promising. So I'll work on the way
as the first step.
Thanks!
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 22 Dec 2017 13:47:16 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171222.134716.88479707.horiguchi.kyotaro@lab.ntt.co.jp>
Anyway, I think we are reached to a consensus that the
time-tick-based expiration is promising. So I'll work on the way
as the first step.
So this is the patch. It gets simpler.
# I became to think that the second step is not needed.
I'm not sure that no syscache aceess happens outside a statement
but the operations that lead to the bloat seem to be performed
while processing of a statement. So statement timestamp seems
sufficient as the aging clock.
At first I tried the simple strategy that removes entries that
have been left alone for 30 minutes or more but I still want to
alleviate the quick bloat (by non-reused entries) so I introduced
together a clock-sweep like aging mechanism. An entry is created
with naccessed = 0, then incremented up to 2 each time it is
accessed. Removal side decrements naccessed of entriies older
than 600 seconds then actually removes if it becomes 0. Entries
that just created and not used will go off in 600 seconds and
entries that have been accessed several times have 1800 seconds'
grace after the last acccess.
We could shrink bucket array together but I didn't since it is
not so large and is prone to grow up to the same size shortly
again.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 2017-12-26 18:19:16 +0900, Kyotaro HORIGUCHI wrote:
--- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,6 +733,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this time stamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); }
Hm.
+ * Remove entries that haven't been accessed for a certain time. + * + * Sometimes catcache entries are left unremoved for several reasons.
I'm unconvinced that that's ok for positive entries, entirely regardless
of this patch.
We + * cannot allow them to eat up the usable memory and still it is better to + * remove entries that are no longer accessed from the perspective of memory + * performance ratio. Unfortunately we cannot predict that but we can assume + * that entries that are not accessed for long time no longer contribute to + * performance. + */
This needs polish.
+static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; +#ifdef CATCACHE_STATS + int ntotal = 0; + int tm[] = {30, 60, 600, 1200, 1800, 0}; + int cn[6] = {0, 0, 0, 0, 0}; + int cage[3] = {0, 0, 0}; +#endif
This doesn't look nice, the names descriptive enough to be self evident,
and there's no comments what these random arrays mean. And some specify
lenght (and have differing number of elements!) and others don't.
+ /* Move all entries from old hash table to new. */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long s; + int us; + + + TimestampDifference(ct->lastaccess, catcacheclock, &s, &us); + +#ifdef CATCACHE_STATS + { + int j; + + ntotal++; + for (j = 0 ; tm[j] != 0 && s > tm[j] ; j++); + if (tm[j] == 0) j--; + cn[j]++; + } +#endif
What?
+ /* + * Remove entries older than 600 seconds but not recently used. + * Entries that are not accessed after creation are removed in 600 + * seconds, and that has been used several times are removed after + * 30 minumtes ignorance. We don't try shrink buckets since they + * are not the major part of syscache bloat and they are expected + * to be filled shortly again. + */ + if (s > 600) + {
So this is hardcoded, without any sort of cache pressure logic? Doesn't
that mean we'll often *severely* degrade performance if a backend is
idle for a while?
Greetings,
Andres Freund
On Thu, Mar 1, 2018 at 1:54 PM, Andres Freund <andres@anarazel.de> wrote:
So this is hardcoded, without any sort of cache pressure logic? Doesn't
that mean we'll often *severely* degrade performance if a backend is
idle for a while?
Well, it is true that if we flush cache entries that haven't been used
in a long time, a backend that is idle for a long time might be a bit
slow when (and if) it eventually becomes non-idle, because it may have
to reload some of those flushed entries. On the other hand, a backend
that holds onto a large number of cache entries that it's not using
for tens of minutes at a time degrades the performance of the whole
system unless, of course, you're running on a machine that is under no
memory pressure at all. I don't understand why people keep acting as
if holding onto cache entries regardless of how infrequently they're
being used is an unalloyed good.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2018-03-01 14:24:56 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 1:54 PM, Andres Freund <andres@anarazel.de> wrote:
So this is hardcoded, without any sort of cache pressure logic? Doesn't
that mean we'll often *severely* degrade performance if a backend is
idle for a while?Well, it is true that if we flush cache entries that haven't been used
in a long time, a backend that is idle for a long time might be a bit
slow when (and if) it eventually becomes non-idle, because it may have
to reload some of those flushed entries.
Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.
On the other hand, a backend that holds onto a large number of cache
entries that it's not using for tens of minutes at a time degrades the
performance of the whole system unless, of course, you're running on a
machine that is under no memory pressure at all.
But it's *extremely* common to have no memory pressure these days. The
inverse definitely also exists.
I don't understand why people keep acting as if holding onto cache
entries regardless of how infrequently they're being used is an
unalloyed good.
Huh? I'm definitely not arguing for that? I think we want a feature like
this, I just don't think the logic when to prune is quite sophisticated
enough?
Greetings,
Andres Freund
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.
Really? I'm not sure I believe that. You're talking perhaps a few
milliseconds - maybe less - of additional latency on a connection
that's been idle for many minutes. You need to have a workload that
involves leaving connections idle for very long periods but has
extremely tight latency requirements when it does finally send a
query. I suppose such workloads exist, but I would not think them
common.
Anyway, I don't mind making the exact timeout a GUC (with 0 disabling
the feature altogether) if that addresses your concern, but in general
I think that it's reasonable to accept that a connection that's been
idle for a long time may have a little bit more latency than usual
when you start using it again. That could happen for other reasons
anyway -- e.g. the cache could have been flushed because of concurrent
DDL on the objects you were accessing, by a syscache reset caused by a
flood of temp objects being created, or by the operating system
deciding to page out some of your data, or by your data getting
evicted from the CPU caches, or by being scheduled onto a NUMA node
different than the one that contains its data. Operating systems have
been optimizing for the performance of relatively active processes
over ones that have been idle for a long time since the 1960s or
earlier, and I don't know of any reason why PostgreSQL shouldn't do
the same.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.Really? I'm not sure I believe that. You're talking perhaps a few
milliseconds - maybe less - of additional latency on a connection
that's been idle for many minutes.
I've seen latency increases in second+ ranges due to empty cat/sys/rel
caches. And the connection doesn't have to be idle, it might just have
been active for a different application doing different things, thus
accessing different cache entries. With a pooler you can trivially end
up switch connections occasionally between different [parts of]
applications, and you don't want performance to suck after each time.
You also don't want to use up all memory, I entirely agree on that.
Anyway, I don't mind making the exact timeout a GUC (with 0 disabling
the feature altogether) if that addresses your concern, but in general
I think that it's reasonable to accept that a connection that's been
idle for a long time may have a little bit more latency than usual
when you start using it again.
I don't think that'd quite address my concern. I just don't think that
the granularity (drop all entries older than xxx sec at the next resize)
is right. For one I don't want to drop stuff if the cache size isn't a
problem for the current memory budget. For another, I'm not convinced
that dropping entries from the current "generation" at resize won't end
up throwing away too much.
If we'd a guc 'syscache_memory_target' and we'd only start pruning if
above it, I'd be much happier.
Greetings,
Andres Freund
On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.Really? I'm not sure I believe that. You're talking perhaps a few
milliseconds - maybe less - of additional latency on a connection
that's been idle for many minutes.I've seen latency increases in second+ ranges due to empty cat/sys/rel
caches.
How is that even possible unless the system is grossly overloaded?
Anyway, I don't mind making the exact timeout a GUC (with 0 disabling
the feature altogether) if that addresses your concern, but in general
I think that it's reasonable to accept that a connection that's been
idle for a long time may have a little bit more latency than usual
when you start using it again.I don't think that'd quite address my concern. I just don't think that
the granularity (drop all entries older than xxx sec at the next resize)
is right. For one I don't want to drop stuff if the cache size isn't a
problem for the current memory budget. For another, I'm not convinced
that dropping entries from the current "generation" at resize won't end
up throwing away too much.
I think that a fixed memory budget for the syscache is an idea that
was tried many years ago and basically failed, because it's very easy
to end up with terrible eviction patterns -- e.g. if you are accessing
11 relations in round-robin fashion with a 10-relation cache, your
cache nets you a 0% hit rate but takes a lot more maintenance than
having no cache at all. The time-based approach lets the cache grow
with no fixed upper limit without allowing unused entries to stick
around forever.
If we'd a guc 'syscache_memory_target' and we'd only start pruning if
above it, I'd be much happier.
It does seem reasonable to skip pruning altogether if the cache is
below some threshold size.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2018-03-01 15:19:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.Really? I'm not sure I believe that. You're talking perhaps a few
milliseconds - maybe less - of additional latency on a connection
that's been idle for many minutes.I've seen latency increases in second+ ranges due to empty cat/sys/rel
caches.How is that even possible unless the system is grossly overloaded?
You just need to have catalog contents out of cache and statements
touching a few relations, functions, etc. Indexscan + heap fetch
latencies do add up quite quickly if done sequentially.
I don't think that'd quite address my concern. I just don't think that
the granularity (drop all entries older than xxx sec at the next resize)
is right. For one I don't want to drop stuff if the cache size isn't a
problem for the current memory budget. For another, I'm not convinced
that dropping entries from the current "generation" at resize won't end
up throwing away too much.I think that a fixed memory budget for the syscache is an idea that
was tried many years ago and basically failed, because it's very easy
to end up with terrible eviction patterns -- e.g. if you are accessing
11 relations in round-robin fashion with a 10-relation cache, your
cache nets you a 0% hit rate but takes a lot more maintenance than
having no cache at all. The time-based approach lets the cache grow
with no fixed upper limit without allowing unused entries to stick
around forever.
I definitely think we want a time based component to this, I just want
to not prune at all if we're below a certain size.
If we'd a guc 'syscache_memory_target' and we'd only start pruning if
above it, I'd be much happier.It does seem reasonable to skip pruning altogether if the cache is
below some threshold size.
Cool. There might be some issues making that check performant enough,
but I don't have a good intuition on it.
Greetings,
Andres Freund
Hello.
Thank you for the discussion, and sorry for being late to come.
At Thu, 1 Mar 2018 12:26:30 -0800, Andres Freund <andres@anarazel.de> wrote in <20180301202630.2s6untij2x5hpksn@alap3.anarazel.de>
Hi,
On 2018-03-01 15:19:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.Really? I'm not sure I believe that. You're talking perhaps a few
milliseconds - maybe less - of additional latency on a connection
that's been idle for many minutes.I've seen latency increases in second+ ranges due to empty cat/sys/rel
caches.How is that even possible unless the system is grossly overloaded?
You just need to have catalog contents out of cache and statements
touching a few relations, functions, etc. Indexscan + heap fetch
latencies do add up quite quickly if done sequentially.I don't think that'd quite address my concern. I just don't think that
the granularity (drop all entries older than xxx sec at the next resize)
is right. For one I don't want to drop stuff if the cache size isn't a
problem for the current memory budget. For another, I'm not convinced
that dropping entries from the current "generation" at resize won't end
up throwing away too much.I think that a fixed memory budget for the syscache is an idea that
was tried many years ago and basically failed, because it's very easy
to end up with terrible eviction patterns -- e.g. if you are accessing
11 relations in round-robin fashion with a 10-relation cache, your
cache nets you a 0% hit rate but takes a lot more maintenance than
having no cache at all. The time-based approach lets the cache grow
with no fixed upper limit without allowing unused entries to stick
around forever.I definitely think we want a time based component to this, I just want
to not prune at all if we're below a certain size.If we'd a guc 'syscache_memory_target' and we'd only start pruning if
above it, I'd be much happier.It does seem reasonable to skip pruning altogether if the cache is
below some threshold size.Cool. There might be some issues making that check performant enough,
but I don't have a good intuition on it.
So..
- Now it gets two new GUC variables named syscache_prune_min_age
and syscache_memory_target. The former is the replacement of
the previous magic number 600 and defaults to the same
number. The latter prevens syscache pruning until exceeding the
size and defaults to 0, means that pruning is always
considered. Documentation for the two variables are also
added.
- Revised the pointed mysterious comment for
CatcacheCleanupOldEntries and some comments are added.
- Fixed the name of the variables for CATCACHE_STATS to be more
descriptive, and added some comments for the code.
The catcache entries accessed within the current transaction
won't be pruned so theoretically a long transaction can bloat
catcache. But I believe it is quite rare, or at least this saves
the most other cases.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Oops! The previous ptach contained garbage printing in debugging
output.
The attached is the new version without the garbage. Addition to
it, I changed my mind to use DEBUG1 for the debug message since
the frequency is quite low.
No changes in the following cited previous mail.
At Wed, 07 Mar 2018 16:19:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180307.161923.178158050.horiguchi.kyotaro@lab.ntt.co.jp>
==================
Hello.
Thank you for the discussion, and sorry for being late to come.
At Thu, 1 Mar 2018 12:26:30 -0800, Andres Freund <andres@anarazel.de> wrote in <20180301202630.2s6untij2x5hpksn@alap3.anarazel.de>
Hi,
On 2018-03-01 15:19:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.Really? I'm not sure I believe that. You're talking perhaps a few
milliseconds - maybe less - of additional latency on a connection
that's been idle for many minutes.I've seen latency increases in second+ ranges due to empty cat/sys/rel
caches.How is that even possible unless the system is grossly overloaded?
You just need to have catalog contents out of cache and statements
touching a few relations, functions, etc. Indexscan + heap fetch
latencies do add up quite quickly if done sequentially.I don't think that'd quite address my concern. I just don't think that
the granularity (drop all entries older than xxx sec at the next resize)
is right. For one I don't want to drop stuff if the cache size isn't a
problem for the current memory budget. For another, I'm not convinced
that dropping entries from the current "generation" at resize won't end
up throwing away too much.I think that a fixed memory budget for the syscache is an idea that
was tried many years ago and basically failed, because it's very easy
to end up with terrible eviction patterns -- e.g. if you are accessing
11 relations in round-robin fashion with a 10-relation cache, your
cache nets you a 0% hit rate but takes a lot more maintenance than
having no cache at all. The time-based approach lets the cache grow
with no fixed upper limit without allowing unused entries to stick
around forever.I definitely think we want a time based component to this, I just want
to not prune at all if we're below a certain size.If we'd a guc 'syscache_memory_target' and we'd only start pruning if
above it, I'd be much happier.It does seem reasonable to skip pruning altogether if the cache is
below some threshold size.Cool. There might be some issues making that check performant enough,
but I don't have a good intuition on it.
So..
- Now it gets two new GUC variables named syscache_prune_min_age
and syscache_memory_target. The former is the replacement of
the previous magic number 600 and defaults to the same
number. The latter prevens syscache pruning until exceeding the
size and defaults to 0, means that pruning is always
considered. Documentation for the two variables are also
added.
- Revised the pointed mysterious comment for
CatcacheCleanupOldEntries and some comments are added.
- Fixed the name of the variables for CATCACHE_STATS to be more
descriptive, and added some comments for the code.
The catcache entries accessed within the current transaction
won't be pruned so theoretically a long transaction can bloat
catcache. But I believe it is quite rare, or at least this saves
the most other cases.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
The thing that comes to mind when reading this patch is that some time
ago we made fun of other database software, "they are so complicated to
configure, they have some magical settings that few people understand
how to set". Postgres was so much better because it was simple to set
up, no magic crap. But now it becomes apparent that that only was so
because Postgres sucked, ie., we hadn't yet gotten to the point where we
*needed* to introduce settings like that. Now we finally are?
I have to admit being a little disappointed about that outcome.
I wonder if this is just because we refuse to acknowledge the notion of
a connection pooler. If we did, and the pooler told us "here, this
session is being given back to us by the application, we'll keep it
around until the next app comes along", could we clean the oldest
inactive cache entries at that point? Currently they use DISCARD for
that. Though this does nothing to fix hypothetical cache bloat for
pg_dump in bug #14936.
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2018-03-07 08:01:38 -0300, Alvaro Herrera wrote:
I wonder if this is just because we refuse to acknowledge the notion of
a connection pooler. If we did, and the pooler told us "here, this
session is being given back to us by the application, we'll keep it
around until the next app comes along", could we clean the oldest
inactive cache entries at that point? Currently they use DISCARD for
that. Though this does nothing to fix hypothetical cache bloat for
pg_dump in bug #14936.
I'm not seeing how this solves anything? You don't want to throw all
caches away, therefore you need a target size. Then there's also the
case of the cache being too large in a single "session".
Greetings,
Andres Freund
Hello,
Andres Freund wrote:
On 2018-03-07 08:01:38 -0300, Alvaro Herrera wrote:
I wonder if this is just because we refuse to acknowledge the notion of
a connection pooler. If we did, and the pooler told us "here, this
session is being given back to us by the application, we'll keep it
around until the next app comes along", could we clean the oldest
inactive cache entries at that point? Currently they use DISCARD for
that. Though this does nothing to fix hypothetical cache bloat for
pg_dump in bug #14936.I'm not seeing how this solves anything? You don't want to throw all
caches away, therefore you need a target size. Then there's also the
case of the cache being too large in a single "session".
Oh, I wasn't suggesting to throw away the whole cache at that point;
only that that is a convenient to do whatever cleanup we want to do.
What I'm not clear about is exactly what is the cleanup that we want to
do at that point. You say it should be based on some configured size;
Robert says any predefined size breaks [performance for] the case where
the workload uses size+1, so let's use time instead (evict anything not
used in more than X seconds?), but keeping in mind that a workload that
requires X+1 would also break. So it seems we've arrived at the
conclusion that the only possible solution is to let the user tell us
what time/size to use. But that sucks, because the user doesn't know
either (maybe they can measure, but how?), and they don't even know that
this setting is there to be tweaked; and if there is a performance
problem, how do they figure whether or not it can be fixed by fooling
with this parameter? I mean, maybe it's set to 10 and we suggest "maybe
11 works better" but it turns out not to, so "maybe 12 works better"?
How do you know when to stop increasing it?
This seems a bit like max_fsm_pages, that is to say, a disaster that was
only fixed by removing it.
Thanks,
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-03-07 14:48:48 -0300, Alvaro Herrera wrote:
Oh, I wasn't suggesting to throw away the whole cache at that point;
only that that is a convenient to do whatever cleanup we want to do.
But why is that better than doing so continuously?
What I'm not clear about is exactly what is the cleanup that we want to
do at that point. You say it should be based on some configured size;
Robert says any predefined size breaks [performance for] the case where
the workload uses size+1, so let's use time instead (evict anything not
used in more than X seconds?), but keeping in mind that a workload that
requires X+1 would also break.
We mostly seem to have found that adding a *minimum* size before
starting evicting basedon time solves both of our concerns?
So it seems we've arrived at the
conclusion that the only possible solution is to let the user tell us
what time/size to use. But that sucks, because the user doesn't know
either (maybe they can measure, but how?), and they don't even know that
this setting is there to be tweaked; and if there is a performance
problem, how do they figure whether or not it can be fixed by fooling
with this parameter? I mean, maybe it's set to 10 and we suggest "maybe
11 works better" but it turns out not to, so "maybe 12 works better"?
How do you know when to stop increasing it?
I don't think it's that complicated, for the size figure. Having a knob
that controls how much memory a backend uses isn't a new concept, and
can definitely depend on the usecase.
This seems a bit like max_fsm_pages, that is to say, a disaster that was
only fixed by removing it.
I don't think that's a meaningful comparison. max_fms_pages had
persistent effect, couldn't be tuned without restarts, and the
performance dropoffs were much more "cliff" like.
Greetings,
Andres Freund
On Wed, Mar 7, 2018 at 6:01 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
The thing that comes to mind when reading this patch is that some time
ago we made fun of other database software, "they are so complicated to
configure, they have some magical settings that few people understand
how to set". Postgres was so much better because it was simple to set
up, no magic crap. But now it becomes apparent that that only was so
because Postgres sucked, ie., we hadn't yet gotten to the point where we
*needed* to introduce settings like that. Now we finally are?I have to admit being a little disappointed about that outcome.
I think your disappointment is a little excessive. I am not convinced
of the need either for this to have any GUCs at all, but if it makes
other people happy to have them, then I think it's worth accepting
that as the price of getting the feature into the tree. These are
scarcely the first GUCs we have that are hard to tune. work_mem is a
terrible knob, and there are probably like very few people who know
how to set ssl_ecdh_curve to anything other than the default, and
what's geqo_selection_bias good for, anyway? I'm not sure what makes
the settings we're adding here any different. Most people will ignore
them, and a few people who really care can change the values.
I wonder if this is just because we refuse to acknowledge the notion of
a connection pooler. If we did, and the pooler told us "here, this
session is being given back to us by the application, we'll keep it
around until the next app comes along", could we clean the oldest
inactive cache entries at that point? Currently they use DISCARD for
that. Though this does nothing to fix hypothetical cache bloat for
pg_dump in bug #14936.
We could certainly clean the oldest inactive cache entries at that
point, but there's no guarantee that would be the right thing to do.
If the working set across all applications is small enough that you
can keep them all in the caches all the time, then you should do that,
for maximum performance. If not, DISCARD ALL should probably flush
everything that the last application needed and the next application
won't. But without some configuration knob, you have zero way of
knowing how concerned the user is about saving memory in this place
vs. improving performance by reducing catalog scans. Even with such a
knob it's a little difficult to say which things actually ought to be
thrown away.
I think this is a related problem, but a different one. I also think
we ought to have built-in connection pooling. :-)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org]
The thing that comes to mind when reading this patch is that some time ago
we made fun of other database software, "they are so complicated to configure,
they have some magical settings that few people understand how to set".
Postgres was so much better because it was simple to set up, no magic crap.
But now it becomes apparent that that only was so because Postgres sucked,
ie., we hadn't yet gotten to the point where we
*needed* to introduce settings like that. Now we finally are?
Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly access about 200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs of local memory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connections are used.
I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache, etc?
Regards
Takayuki Tsunakawa
Hello,
At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05>
From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org]
The thing that comes to mind when reading this patch is that some time ago
we made fun of other database software, "they are so complicated to configure,
they have some magical settings that few people understand how to set".
Postgres was so much better because it was simple to set up, no magic crap.
But now it becomes apparent that that only was so because Postgres sucked,
ie., we hadn't yet gotten to the point where we
*needed* to introduce settings like that. Now we finally are?Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly access about 200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs of local memory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connections are used.
I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache, etc?
This works only for syscaches, which could bloat with entries for
nonexistent objects.
Plan cache is a utterly deferent thing. It is abandoned at the
end of a transaction or such like.
Relcache is not based on catcache and out of the scope of this
patch since it doesn't get bloat with nonexistent entries. It
uses dynahash and we could introduce a similar feature to it if
we are willing to cap relcache size.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05>
Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly access about 200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs of local memory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connections are used.
I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache, etc?
This works only for syscaches, which could bloat with entries for
nonexistent objects.
Plan cache is a utterly deferent thing. It is abandoned at the
end of a transaction or such like.
When I was at Salesforce, we had *substantial* problems with plancache
bloat. The driving factor there was plans associated with plpgsql
functions, which Salesforce had a huge number of. In an environment
like that, there would be substantial value in being able to prune
both the plancache and plpgsql's function cache. (Note that neither
of those things are "abandoned at the end of a transaction".)
Relcache is not based on catcache and out of the scope of this
patch since it doesn't get bloat with nonexistent entries. It
uses dynahash and we could introduce a similar feature to it if
we are willing to cap relcache size.
I think if the case of concern is an application with 200,000 tables,
it's just nonsense to claim that relcache size isn't an issue.
In short, it's not really apparent to me that negative syscache entries
are the major problem of this kind. I'm afraid that you're drawing very
large conclusions from a specific workload. Maybe we could fix that
workload some other way.
regards, tom lane
At Wed, 07 Mar 2018 23:12:29 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <352.1520482349@sss.pgh.pa.us>
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05>
Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly access about 200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs of local memory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connections are used.
I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache, etc?
This works only for syscaches, which could bloat with entries for
nonexistent objects.Plan cache is a utterly deferent thing. It is abandoned at the
end of a transaction or such like.When I was at Salesforce, we had *substantial* problems with plancache
bloat. The driving factor there was plans associated with plpgsql
functions, which Salesforce had a huge number of. In an environment
like that, there would be substantial value in being able to prune
both the plancache and plpgsql's function cache. (Note that neither
of those things are "abandoned at the end of a transaction".)
Mmm. Right. Thanks for pointing it. Anyway plan cache seems to be
a different thing.
Relcache is not based on catcache and out of the scope of this
patch since it doesn't get bloat with nonexistent entries. It
uses dynahash and we could introduce a similar feature to it if
we are willing to cap relcache size.I think if the case of concern is an application with 200,000 tables,
it's just nonsense to claim that relcache size isn't an issue.In short, it's not really apparent to me that negative syscache entries
are the major problem of this kind. I'm afraid that you're drawing very
large conclusions from a specific workload. Maybe we could fix that
workload some other way.
The current patch doesn't consider whether an entry is negative
or positive(?). Just clean up all entries based on time.
If relation has to have the same characterictics to syscaches, it
might be better be on the catcache mechanism, instaed of adding
the same pruning mechanism to dynahash..
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 09 Mar 2018 17:40:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180309.174001.202113825.horiguchi.kyotaro@lab.ntt.co.jp>
In short, it's not really apparent to me that negative syscache entries
are the major problem of this kind. I'm afraid that you're drawing very
large conclusions from a specific workload. Maybe we could fix that
workload some other way.The current patch doesn't consider whether an entry is negative
or positive(?). Just clean up all entries based on time.If relation has to have the same characterictics to syscaches, it
might be better be on the catcache mechanism, instaed of adding
the same pruning mechanism to dynahash..
For the moment, I added such feature to dynahash and let only
relcache use it in this patch. Hash element has different shape
in "prunable" hash and pruning is performed in a similar way
sharing the setting with syscache. This seems working fine.
It is bit uneasy that all syscaches and relcache shares the same
value of syscache_memory_target...
Something like the sttached test script causes relcache
"bloat". Server emits the following log entries in DEBUG1 message
level.
DEBUG: removed 11240/32769 entries from hash "Relcache by OID" at character 15
# The last few words are just garbage I mentioned in another thread.
The last two patches do that (as PoC).
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Oops.
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp>
Something like the sttached test script causes relcache
This is that.
Attachments:
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp>
In short, it's not really apparent to me that negative syscache entries
are the major problem of this kind. I'm afraid that you're drawing very
large conclusions from a specific workload. Maybe we could fix that
workload some other way.The current patch doesn't consider whether an entry is negative
or positive(?). Just clean up all entries based on time.If relation has to have the same characterictics to syscaches, it
might be better be on the catcache mechanism, instaed of adding
the same pruning mechanism to dynahash..For the moment, I added such feature to dynahash and let only
relcache use it in this patch. Hash element has different shape
in "prunable" hash and pruning is performed in a similar way
sharing the setting with syscache. This seems working fine.
I gave consideration on plancache. The most different
characteristics from catcache and relcache is the fact that it is
not voluntarily removable since CachedPlanSource, the root struct
of a plan cache, holds some indispensable inforamtion. In regards
to prepared queries, even if we store the information into
another location, for example in "Prepred Queries" hash, it
merely moving a big data into another place.
Looking into CachedPlanSoruce, generic plan is a part that is
safely removable since it is rebuilt as necessary. Keeping "old"
plancache entries not holding a generic plan can reduce memory
usage.
For testing purpose, I made 50000 parepared statement like
"select sum(c) from p where e < $" on 100 partitions,
With disabling the feature (0004 patch) VSZ of the backend
exceeds 3GB (It is still increasing at the moment), while it
stops to increase at about 997MB for min_cached_plans = 1000 and
plancache_prune_min_age = '10s'.
# 10s is apparently short for acutual use, of course.
It is expected to be significant amount if the plan is large
enough but I'm still not sure it is worth doing, or is a right
way.
The attached is the patch set including this plancache stuff.
0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in >
The attached is the patch set including this plancache stuff.0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.
It looks like this should be marked Needs Review so I have done so. If
that's not right please change it back or let me know and I will.
Regards,
--
-David
david@pgmasters.net
Hello.
At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in <43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net>
On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in >
The attached is the patch set including this plancache stuff.0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.It looks like this should be marked Needs Review so I have done so. If
that's not right please change it back or let me know and I will.
Mmm. I haven't noticed that. Thanks!
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2018-03-23 17:01:11 +0900, Kyotaro HORIGUCHI wrote:
Hello.
At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in <43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net>
On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in >
The attached is the patch set including this plancache stuff.0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.It looks like this should be marked Needs Review so I have done so. If
that's not right please change it back or let me know and I will.Mmm. I haven't noticed that. Thanks!
I actually think this should be marked as returned with feedback, or at
the very least moved to the next CF. This is entirely new development
within the last CF. There's no realistic way we can get this into v11.
Greetings,
Andres Freund
At Thu, 29 Mar 2018 18:22:59 -0700, Andres Freund <andres@anarazel.de> wrote in <20180330012259.7k3442yz7jighg2t@alap3.anarazel.de>
On 2018-03-23 17:01:11 +0900, Kyotaro HORIGUCHI wrote:
Hello.
At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in <43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net>
On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in >
The attached is the patch set including this plancache stuff.0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.It looks like this should be marked Needs Review so I have done so. If
that's not right please change it back or let me know and I will.Mmm. I haven't noticed that. Thanks!
I actually think this should be marked as returned with feedback, or at
the very least moved to the next CF. This is entirely new development
within the last CF. There's no realistic way we can get this into v11.
0002-0004 is new, in response to the comment that caches other
than the catcache ought to get the same feature. These can be a
separate development from 0001 for v12. I don't find a measures
to catch the all case at once.
If we agree on the point. I wish to discuss only 0001 for v11.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi,
On 2018-03-30 10:35:48 +0900, Kyotaro HORIGUCHI wrote:
0002-0004 is new, in response to the comment that caches other
than the catcache ought to get the same feature. These can be a
separate development from 0001 for v12. I don't find a measures
to catch the all case at once.If we agree on the point. I wish to discuss only 0001 for v11.
I'd personally not want to commit a solution for catcaches without also
commiting a solution for a least relcaches in the same release cycle. I
think this patch simply has missed the window for v11.
Greetings,
Andres Freund
At Thu, 29 Mar 2018 18:51:45 -0700, Andres Freund <andres@anarazel.de> wrote in <20180330015145.pvsr6kjtf6tw4uwe@alap3.anarazel.de>
Hi,
On 2018-03-30 10:35:48 +0900, Kyotaro HORIGUCHI wrote:
0002-0004 is new, in response to the comment that caches other
than the catcache ought to get the same feature. These can be a
separate development from 0001 for v12. I don't find a measures
to catch the all case at once.If we agree on the point. I wish to discuss only 0001 for v11.
I'd personally not want to commit a solution for catcaches without also
commiting a solution for a least relcaches in the same release cycle. I
think this patch simply has missed the window for v11.
Ok. Agreed. I moved this to the next CF.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello. I rebased this patchset.
At Thu, 15 Mar 2018 14:12:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180315.141246.130742928.horiguchi.kyotaro@lab.ntt.co.jp>
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp>
In short, it's not really apparent to me that negative syscache entries
are the major problem of this kind. I'm afraid that you're drawing very
large conclusions from a specific workload. Maybe we could fix that
workload some other way.The current patch doesn't consider whether an entry is negative
or positive(?). Just clean up all entries based on time.If relation has to have the same characterictics to syscaches, it
might be better be on the catcache mechanism, instaed of adding
the same pruning mechanism to dynahash..
This means unifying catcache and dynahash. It doesn't seem
win-win consolidation. Addition to that relcache links palloc'ed
memory which needs additional treat.
Or we could abstract the pruning mechanism applicable to both
machinaries. Specifically unifying CatCacheCleanupOldEntries in
0001 and prune_entries in 0002. Or could refactor dynahash and
rebuild catcache based on dynahash.
For the moment, I added such feature to dynahash and let only
relcache use it in this patch. Hash element has different shape
in "prunable" hash and pruning is performed in a similar way
sharing the setting with syscache. This seems working fine.I gave consideration on plancache. The most different
characteristics from catcache and relcache is the fact that it is
not voluntarily removable since CachedPlanSource, the root struct
of a plan cache, holds some indispensable inforamtion. In regards
to prepared queries, even if we store the information into
another location, for example in "Prepred Queries" hash, it
merely moving a big data into another place.Looking into CachedPlanSoruce, generic plan is a part that is
safely removable since it is rebuilt as necessary. Keeping "old"
plancache entries not holding a generic plan can reduce memory
usage.For testing purpose, I made 50000 parepared statement like
"select sum(c) from p where e < $" on 100 partitions,With disabling the feature (0004 patch) VSZ of the backend
exceeds 3GB (It is still increasing at the moment), while it
stops to increase at about 997MB for min_cached_plans = 1000 and
plancache_prune_min_age = '10s'.# 10s is apparently short for acutual use, of course.
It is expected to be significant amount if the plan is large
enough but I'm still not sure it is worth doing, or is a right
way.The attached is the patch set including this plancache stuff.
0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.
I found up to v3 in this thread so I named this version 4.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 06/26/2018 05:00 AM, Kyotaro HORIGUCHI wrote:
The attached is the patch set including this plancache stuff.
0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.I found up to v3 in this thread so I named this version 4.
Andres suggested back in March (and again privately to me) that given
how much this has changed from the original this CF item should be
marked Returned With Feedback and the current patchset submitted as a
new item.
Does anyone object to that course of action?
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello. The previous v4 patchset was just broken.
At Tue, 26 Jun 2018 18:00:03 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180626.180003.127457941.horiguchi.kyotaro@lab.ntt.co.jp>
Hello. I rebased this patchset.
..
The attached is the patch set including this plancache stuff.
0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
plancache pruning. Details are shown above.I found up to v3 in this thread so I named this version 4.
Somehow the 0004 was merged into the 0003 and applying 0004
results in failure. I removed 0004 part from the 0003 and rebased
and repost it.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 2018-Jul-02, Andrew Dunstan wrote:
Andres suggested back in March (and again privately to me) that given how
much this has changed from the original this CF item should be marked
Returned With Feedback and the current patchset submitted as a new item.Does anyone object to that course of action?
If doing that makes the "CF count" reset back to one for the new
submission, then I object to that course of action. If we really
think this item does not belong into this commitfest, lets punt it to
the next one. However, it seems rather strange to do so this early in
the cycle. Is there really no small item that could be cherry-picked
from this series to be committed standalone?
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2018-07-02 21:50:36 -0400, Alvaro Herrera wrote:
On 2018-Jul-02, Andrew Dunstan wrote:
Andres suggested back in March (and again privately to me) that given how
much this has changed from the original this CF item should be marked
Returned With Feedback and the current patchset submitted as a new item.Does anyone object to that course of action?
If doing that makes the "CF count" reset back to one for the new
submission, then I object to that course of action. If we really
think this item does not belong into this commitfest, lets punt it to
the next one. However, it seems rather strange to do so this early in
the cycle. Is there really no small item that could be cherry-picked
from this series to be committed standalone?
Well, I think it should just have been RWFed last cycle. It got plenty
of feedback. So it doesn't seem that strange to me, not to include it in
the "mop-up" CF? Either way, I don't feel strongly about it, I just know
that I won't have energy for the topic in this CF.
Greetings,
Andres Freund
Hi,
Subject: Re: Protect syscache from bloating with negative cache entries
Hello. The previous v4 patchset was just broken.
Somehow the 0004 was merged into the 0003 and applying 0004 results in failure. I
removed 0004 part from the 0003 and rebased and repost it.
I have some questions about syscache and relcache pruning
though they may be discussed at upper thread or out of point.
Can I confirm about catcache pruning?
syscache_memory_target is the max figure per CatCache.
(Any CatCache has the same max value.)
So the total max size of catalog caches is estimated around or
slightly more than # of SysCache array times syscache_memory_target.
If correct, I'm thinking writing down the above estimation to the document
would help db administrators with estimation of memory usage.
Current description might lead misunderstanding that syscache_memory_target
is the total size of catalog cache in my impression.
Related to the above I just thought changing sysycache_memory_target per CatCache
would make memory usage more efficient.
Though I haven't checked if there's a case that each system catalog cache memory usage varies largely,
pg_class cache might need more memory than others and others might need less.
But it would be difficult for users to check each CatCache memory usage and tune it
because right now postgresql hasn't provided a handy way to check them.
Another option is that users only specify the total memory target size and postgres
dynamically change each CatCache memory target size according to a certain metric.
(, which still seems difficult and expensive to develop per benefit)
What do you think about this?
+ /*
+ * Set up pruning.
+ *
+ * We have two knobs to control pruning and a hash can share them of
+ * syscache.
+ *
+ */
+ if (flags & HASH_PRUNABLE)
+ {
+ hctl->prunable = true;
+ hctl->prune_cb = info->prune_cb;
+ if (info->memory_target)
+ hctl->memory_target = info->memory_target;
+ else
+ hctl->memory_target = &cache_memory_target;
+ if (info->prune_min_age)
+ hctl->prune_min_age = info->prune_min_age;
+ else
+ hctl->prune_min_age = &cache_prune_min_age;
+ }
+ else
+ hctl->prunable = false;
As you commented here, guc variable syscache_memory_target and
syscache_prune_min_age are used for both syscache and relcache (HTAB), right?
Do syscache and relcache have the similar amount of memory usage?
If not, I'm thinking that introducing separate guc variable would be fine.
So as syscache_prune_min_age.
Regards,
====================
Takeshi Ideriha
Fujitsu Limited
Hello. Thank you for looking this.
At Wed, 12 Sep 2018 05:16:52 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F197012@G01JPEXMBKW04>
Hi,
Subject: Re: Protect syscache from bloating with negative cache entries
Hello. The previous v4 patchset was just broken.
Somehow the 0004 was merged into the 0003 and applying 0004 results in failure. I
removed 0004 part from the 0003 and rebased and repost it.I have some questions about syscache and relcache pruning
though they may be discussed at upper thread or out of point.Can I confirm about catcache pruning?
syscache_memory_target is the max figure per CatCache.
(Any CatCache has the same max value.)
So the total max size of catalog caches is estimated around or
slightly more than # of SysCache array times syscache_memory_target.
Right.
If correct, I'm thinking writing down the above estimation to the document
would help db administrators with estimation of memory usage.
Current description might lead misunderstanding that syscache_memory_target
is the total size of catalog cache in my impression.
Honestly I'm not sure that is the right design. Howerver, I don't
think providing such formula to users helps users, since they
don't know exactly how many CatCaches and brothres live in their
server and it is a soft limit, and finally only few or just one
catalogs can reach the limit.
The current design based on the assumption that we would have
only one extremely-growable cache in one use case.
Related to the above I just thought changing sysycache_memory_target per CatCache
would make memory usage more efficient.
We could easily have per-cache settings in CatCache, but how do
we provide the knobs for them? I can guess only too much
solutions for that.
Though I haven't checked if there's a case that each system catalog cache memory usage varies largely,
pg_class cache might need more memory than others and others might need less.
But it would be difficult for users to check each CatCache memory usage and tune it
because right now postgresql hasn't provided a handy way to check them.
I supposed that this is used without such a means. Someone
suffers syscache bloat just can set this GUC to avoid the
bloat. End.
Apart from that, in the current patch, syscache_memory_target is
not exact at all in the first place to avoid overhead to count
the correct size. The major difference comes from the size of
cache tuple itself. But I came to think it is too much to omit.
As a *PoC*, in the attached patch (which applies to current
master), size of CTups are counted as the catcache size.
It also provides pg_catcache_size system view just to give a
rough idea of how such view looks. I'll consider more on that but
do you have any opinion on this?
=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order by size desc;
relid | indid | size
-------------------------+-------------------------------------------+--------
pg_class | pg_class_oid_index | 131072
pg_class | pg_class_relname_nsp_index | 131072
pg_cast | pg_cast_source_target_index | 5504
pg_operator | pg_operator_oprname_l_r_n_index | 4096
pg_statistic | pg_statistic_relid_att_inh_index | 2048
pg_proc | pg_proc_proname_args_nsp_index | 2048
..
Another option is that users only specify the total memory target size and postgres
dynamically change each CatCache memory target size according to a certain metric.
(, which still seems difficult and expensive to develop per benefit)
What do you think about this?
Given that few caches bloat at once, it's effect is not so
different from the current design.
As you commented here, guc variable syscache_memory_target and
syscache_prune_min_age are used for both syscache and relcache (HTAB), right?
Right, just not to add knobs for unclear reasons. Since ...
Do syscache and relcache have the similar amount of memory usage?
They may be different but would make not so much in the case of
cache bloat.
If not, I'm thinking that introducing separate guc variable would be fine.
So as syscache_prune_min_age.
I implemented that so that it is easily replaceable in case, but
I'm not sure separating them makes significant difference..
Thanks for the opinion, I'll put consideration on this more.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hi, thank you for the explanation.
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Can I confirm about catcache pruning?
syscache_memory_target is the max figure per CatCache.
(Any CatCache has the same max value.) So the total max size of
catalog caches is estimated around or slightly more than # of SysCache
array times syscache_memory_target.Right.
If correct, I'm thinking writing down the above estimation to the
document would help db administrators with estimation of memory usage.
Current description might lead misunderstanding that
syscache_memory_target is the total size of catalog cache in my impression.Honestly I'm not sure that is the right design. Howerver, I don't think providing such
formula to users helps users, since they don't know exactly how many CatCaches and
brothres live in their server and it is a soft limit, and finally only few or just one catalogs
can reach the limit.
Yeah, I agree with that kind of formula is not suited for the document.
But if users don't know how many catcaches and brothers is used at postgres,
then how about changing syscache_memory_target as total soft limit of catcache,
rather than size limit of individual catcache. Internally syscache_memory_target can
be divided by the number of Syscache and does its work. The total amount would be
easier to understand for users who don't know the detailed contents of catalog caches.
Or if user can tell how many/what kind of catcaches exists, for instance by using
the system view you provided in the previous email, the current design looks good to me.
The current design based on the assumption that we would have only one
extremely-growable cache in one use case.Related to the above I just thought changing sysycache_memory_target
per CatCache would make memory usage more efficient.We could easily have per-cache settings in CatCache, but how do we provide the knobs
for them? I can guess only too much solutions for that.
Agreed.
Though I haven't checked if there's a case that each system catalog
cache memory usage varies largely, pg_class cache might need more memory thanothers and others might need less.
But it would be difficult for users to check each CatCache memory
usage and tune it because right now postgresql hasn't provided a handy way tocheck them.
I supposed that this is used without such a means. Someone suffers syscache bloat
just can set this GUC to avoid the bloat. End.
Yeah, I took the purpose wrong.
Apart from that, in the current patch, syscache_memory_target is not exact at all in
the first place to avoid overhead to count the correct size. The major difference comes
from the size of cache tuple itself. But I came to think it is too much to omit.As a *PoC*, in the attached patch (which applies to current master), size of CTups are
counted as the catcache size.It also provides pg_catcache_size system view just to give a rough idea of how such
view looks. I'll consider more on that but do you have any opinion on this?=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order by size
desc;
relid | indid | size
-------------------------+-------------------------------------------+--
-------------------------+-------------------------------------------+--
-------------------------+-------------------------------------------+--
-------------------------+-------------------------------------------+--
pg_class | pg_class_oid_index | 131072
pg_class | pg_class_relname_nsp_index | 131072
pg_cast | pg_cast_source_target_index | 5504
pg_operator | pg_operator_oprname_l_r_n_index | 4096
pg_statistic | pg_statistic_relid_att_inh_index | 2048
pg_proc | pg_proc_proname_args_nsp_index | 2048
..
Great! I like this view.
One of the extreme idea would be adding all the members printed by CatCachePrintStats(),
which is only enabled with -DCATCACHE_STATS at this moment.
All of the members seems too much for customers who tries to change the cache limit size
But it may be some of the members are useful because for example cc_hits would indicate that current
cache limit size is too small.
Another option is that users only specify the total memory target size
and postgres dynamically change each CatCache memory target size according to acertain metric.
(, which still seems difficult and expensive to develop per benefit)
What do you think about this?Given that few caches bloat at once, it's effect is not so different from the current
design.
Yes agreed.
As you commented here, guc variable syscache_memory_target and
syscache_prune_min_age are used for both syscache and relcache (HTAB), right?Right, just not to add knobs for unclear reasons. Since ...
Do syscache and relcache have the similar amount of memory usage?
They may be different but would make not so much in the case of cache bloat.
If not, I'm thinking that introducing separate guc variable would be fine.
So as syscache_prune_min_age.I implemented that so that it is easily replaceable in case, but I'm not sure separating
them makes significant difference..
Maybe I was overthinking mixing my development.
Regards,
Takeshi Ideriha
Hello. Thank you for the comment.
At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04>
As a *PoC*, in the attached patch (which applies to current master), size of CTups are
counted as the catcache size.It also provides pg_catcache_size system view just to give a rough idea of how such
view looks. I'll consider more on that but do you have any opinion on this?
...
Great! I like this view.
One of the extreme idea would be adding all the members printed by CatCachePrintStats(),
which is only enabled with -DCATCACHE_STATS at this moment.
All of the members seems too much for customers who tries to change the cache limit size
But it may be some of the members are useful because for example cc_hits would indicate that current
cache limit size is too small.
The attached introduces four features below. (But the features on
relcache and plancache are omitted).
1. syscache stats collector (in 0002)
Records syscache status consists of the same columns above and
"ageclass" information. We could somehow triggering a stats
report with signal but we don't want take/send/write the
statistics in signal handler. Instead, it is turned on by setting
track_syscache_usage_interval to a positive number in
milliseconds.
2. pg_stat_syscache view. (in 0002)
This view shows catcache statistics. Statistics is taken only on
the backends where syscache tracking is active.
pid | application_name | relname | cache_name | size | ageclass | nentries
------+------------------+----------------+-----------------------------------+----------+-------------------------+---------------------------
9984 | psql | pg_statistic | pg_statistic_relid_att_inh_index | 12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0}
Age class is the basis of catcache truncation mechanism and shows
the distribution based on elapsed time since last access. As I
didn't came up an appropriate way, it is represented as two
arrays. Ageclass stores maximum age for each class in
seconds. Nentries holds entry numbers correnponding to the same
element in ageclass. In the above example,
age class : # of entries in the cache
up to 30s : 17660
up to 60s : 17310
up to 600s : 55870
up to 1200s : 0
up to 1800s : 0
more longer : 0
The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of
cache_prune_min_age on the backend.
3. non-transactional GUC setting (in 0003)
It allows setting GUC variable set by the action
GUC_ACTION_NONXACT(the name requires condieration) survive beyond
rollback. It is required by remote guc setting to work
sanely. Without the feature a remote-set value within a trasction
will disappear involved in rollback. The only local interface for
the NONXACT action is set_config(name, value, is_local=false,
is_nonxact = true). pg_set_backend_guc() below works on this
feature.
4. pg_set_backend_guc() function.
Of course syscache statistics recording consumes significant
amount of time so it cannot be turned on usually. On the other
hand since this feature is turned on by GUC, it is needed to grab
the active client connection to turn on/off the feature(but we
cannot). Instead, I provided a means to change GUC variables in
another backend.
pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
on the backend "pid" to "value".
With the above tools, we can inspect catcache statistics of
seemingly bloated process.
A. Find a bloated process pid using ps or something.
B. Turn on syscache stats on the process.
=# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000');
C. Examine the statitics.
=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc limit 3;
pid | relname | cache_name | size
------+--------------+----------------------------------+----------
9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
9984 | pg_cast | pg_cast_source_target_index | 4096
9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096
=# select * from pg_stat_syscache where cache_name = 'pg_statistic_relid_att_inh_index'::regclass;
-[ RECORD 1 ]---------------------------------
pid | 9984
relname | pg_statistic
cache_name | pg_statistic_relid_att_inh_index
size | 11026176
ntuples | 77950
searches | 77950
hits | 0
neg_hits | 0
ageclass | {30,60,600,1200,1800,0}
nentries | {17630,16950,43370,0,0,0}
last_update | 2018-10-17 15:58:19.738164+09
Another option is that users only specify the total memory target size
and postgres dynamically change each CatCache memory target size according to acertain metric.
(, which still seems difficult and expensive to develop per benefit)
What do you think about this?Given that few caches bloat at once, it's effect is not so different from the current
design.Yes agreed.
As you commented here, guc variable syscache_memory_target and
syscache_prune_min_age are used for both syscache and relcache (HTAB), right?Right, just not to add knobs for unclear reasons. Since ...
Do syscache and relcache have the similar amount of memory usage?
They may be different but would make not so much in the case of cache bloat.
If not, I'm thinking that introducing separate guc variable would be fine.
So as syscache_prune_min_age.I implemented that so that it is easily replaceable in case, but I'm not sure separating
them makes significant difference..Maybe I was overthinking mixing my development.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hello, thank you for updating the patch.
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi"
<ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04>As a *PoC*, in the attached patch (which applies to current master),
size of CTups are counted as the catcache size.It also provides pg_catcache_size system view just to give a rough
idea of how such view looks. I'll consider more on that but do you have any opinionon this?
...
Great! I like this view.
One of the extreme idea would be adding all the members printed by
CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at thismoment.
All of the members seems too much for customers who tries to change
the cache limit size But it may be some of the members are useful
because for example cc_hits would indicate that current cache limit size is too small.The attached introduces four features below. (But the features on relcache and
plancache are omitted).
I haven't looked into the code but I'm going to do it later.
Right now It seems to me that focusing on catalog cache invalidation and its stats a quick route
to commit this feature.
1. syscache stats collector (in 0002)
Records syscache status consists of the same columns above and "ageclass"
information. We could somehow triggering a stats report with signal but we don't want
take/send/write the statistics in signal handler. Instead, it is turned on by setting
track_syscache_usage_interval to a positive number in milliseconds.
I agreed. Agecalss is important to tweak the prune_min_age.
Collecting stats is heavy at every stats change
2. pg_stat_syscache view. (in 0002)
This view shows catcache statistics. Statistics is taken only on the backends where
syscache tracking is active.pid | application_name | relname | cache_name
| size | ageclass | nentries
------+------------------+----------------+-----------------------------------+---------- +-------------------------+---------------------------9984 | psql | pg_statistic | pg_statistic_relid_att_inh_index |
12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0}
Age class is the basis of catcache truncation mechanism and shows the distribution
based on elapsed time since last access. As I didn't came up an appropriate way, it is
represented as two arrays. Ageclass stores maximum age for each class in seconds.
Nentries holds entry numbers correnponding to the same element in ageclass. In the
above example,age class : # of entries in the cache
up to 30s : 17660
up to 60s : 17310
up to 600s : 55870
up to 1200s : 0
up to 1800s : 0
more longer : 0The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of cache_prune_min_age on the
backend.
I just thought that the pair of ageclass and nentries can be represented as
json or multi-dimensional array but in virtual they are all same and can be converted each other
using some functions. So I'm not sure which representaion is better one.
3. non-transactional GUC setting (in 0003)
It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name
requires condieration) survive beyond rollback. It is required by remote guc setting to
work sanely. Without the feature a remote-set value within a trasction will disappear
involved in rollback. The only local interface for the NONXACT action is
set_config(name, value, is_local=false, is_nonxact = true). pg_set_backend_guc()
below works on this feature.
TBH, I'm not familiar with around this and I may be missing something.
In order to change the other backend's GUC value,
is ignoring transactional behevior always necessary? When transaction of GUC setting
is failed and rollbacked, if the error message is supposeed to be reported I thought
just trying the transaction again is enough.
4. pg_set_backend_guc() function.
Of course syscache statistics recording consumes significant amount of time so it
cannot be turned on usually. On the other hand since this feature is turned on by GUC,
it is needed to grab the active client connection to turn on/off the feature(but we
cannot). Instead, I provided a means to change GUC variables in another backend.pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
on the backend "pid" to "value".With the above tools, we can inspect catcache statistics of seemingly bloated process.
A. Find a bloated process pid using ps or something.
B. Turn on syscache stats on the process.
=# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000');C. Examine the statitics.
=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc
limit 3;
pid | relname | cache_name | size
------+--------------+----------------------------------+----------
9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
9984 | pg_cast | pg_cast_source_target_index | 4096
9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096=# select * from pg_stat_syscache where cache_name =
'pg_statistic_relid_att_inh_index'::regclass;
-[ RECORD 1 ]---------------------------------
pid | 9984
relname | pg_statistic
cache_name | pg_statistic_relid_att_inh_index
size | 11026176
ntuples | 77950
searches | 77950
hits | 0
neg_hits | 0
ageclass | {30,60,600,1200,1800,0}
nentries | {17630,16950,43370,0,0,0}
last_update | 2018-10-17 15:58:19.738164+09
The output of this view seems good to me.
I can imagine this use case. Does the use case of setting GUC locally never happen?
I mean can the setting be locally changed?
Regards,
Takeshi Ideriha
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
I haven't looked into the code but I'm going to do it later.
Hi, I've taken a look at 0001 patch. Reviewing the rest of patch will be later.
if (!IsParallelWorker())
+ {
stmtStartTimestamp = GetCurrentTimestamp();
+
+ /* Set this timestamp as aproximated current time */
+ SetCatCacheClock(stmtStartTimestamp);
+ }
else
Just confirmation.
At first I thought that when parallel worker is active catcacheclock is not updated.
But when parallel worker is active catcacheclock is updated by the parent and no problem is occurred.
+ int tupsize = 0;
/* negative entries have no tuple associated */
if (ntp)
{
int i;
+ int tupsize;
+ ct->size = tupsize;
@@ -1906,17 +2051,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
ct->dead = false;
ct->negative = negative;
ct->hash_value = hashValue;
+ ct->naccess = 0;
+ ct->lastaccess = catcacheclock;
+ ct->size = tupsize;
tupsize is declared twice inside and outiside of if scope but it doesn't seem you need to do so.
And ct->size = tupsize is executed twice at if block and outside of if-else block.
+static inline TimestampTz
+GetCatCacheClock(void)
This function is not called by anyone in this version of patch. In previous version, this one is called by plancache.
Will further patch focus only on catcache? In this case this one can be removed.
There are some typos.
+ int size; /* palloc'ed size off this tuple */
typo: off->of
+ /* Set this timestamp as aproximated current time */
typo: aproximated->approximated
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
typo: cosider -> consider
+ /* initilize catcache reference clock if haven't done yet */
typo:initilize -> initialize
Regards,
Takeshi Ideriha
Thank you for reviewing.
At Thu, 15 Nov 2018 11:02:10 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F1F4165@G01JPEXMBKW04>
Hello, thank you for updating the patch.
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi"
<ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04>As a *PoC*, in the attached patch (which applies to current master),
size of CTups are counted as the catcache size.It also provides pg_catcache_size system view just to give a rough
idea of how such view looks. I'll consider more on that but do you have any opinionon this?
...
Great! I like this view.
One of the extreme idea would be adding all the members printed by
CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at thismoment.
All of the members seems too much for customers who tries to change
the cache limit size But it may be some of the members are useful
because for example cc_hits would indicate that current cache limit size is too small.The attached introduces four features below. (But the features on relcache and
plancache are omitted).I haven't looked into the code but I'm going to do it later.
Right now It seems to me that focusing on catalog cache invalidation and its stats a quick route
to commit this feature.1. syscache stats collector (in 0002)
Records syscache status consists of the same columns above and "ageclass"
information. We could somehow triggering a stats report with signal but we don't want
take/send/write the statistics in signal handler. Instead, it is turned on by setting
track_syscache_usage_interval to a positive number in milliseconds.I agreed. Agecalss is important to tweak the prune_min_age.
Collecting stats is heavy at every stats change2. pg_stat_syscache view. (in 0002)
This view shows catcache statistics. Statistics is taken only on the backends where
syscache tracking is active.pid | application_name | relname | cache_name
| size | ageclass | nentries
------+------------------+----------------+-----------------------------------+---------- +-------------------------+---------------------------9984 | psql | pg_statistic | pg_statistic_relid_att_inh_index |
12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0}
Age class is the basis of catcache truncation mechanism and shows the distribution
based on elapsed time since last access. As I didn't came up an appropriate way, it is
represented as two arrays. Ageclass stores maximum age for each class in seconds.
Nentries holds entry numbers correnponding to the same element in ageclass. In the
above example,age class : # of entries in the cache
up to 30s : 17660
up to 60s : 17310
up to 600s : 55870
up to 1200s : 0
up to 1800s : 0
more longer : 0The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of cache_prune_min_age on the
backend.I just thought that the pair of ageclass and nentries can be represented as
json or multi-dimensional array but in virtual they are all same and can be converted each other
using some functions. So I'm not sure which representaion is better one.
Multi dimentional array in any style sounds reasonable. Maybe
array is preferable in system views as it is a basic type than
JSON. In the attached, it looks like the follows:
=# select * from pg_stat_syscache where ntuples > 100;
-[ RECORD 1 ]--------------------------------------------------
pid | 1817
relname | pg_class
cache_name | pg_class_oid_index
size | 2048
ntuples | 189
searches | 1620
hits | 1431
neg_hits | 0
ageclass | {{30,189},{60,0},{600,0},{1200,0},{1800,0},{0,0}}
last_update | 2018-11-27 19:22:00.74026+09
3. non-transactional GUC setting (in 0003)
It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name
requires condieration) survive beyond rollback. It is required by remote guc setting to
work sanely. Without the feature a remote-set value within a trasction will disappear
involved in rollback. The only local interface for the NONXACT action is
set_config(name, value, is_local=false, is_nonxact = true). pg_set_backend_guc()
below works on this feature.TBH, I'm not familiar with around this and I may be missing something.
In order to change the other backend's GUC value,
is ignoring transactional behevior always necessary? When transaction of GUC setting
is failed and rollbacked, if the error message is supposeed to be reported I thought
just trying the transaction again is enough.
The target backend can be running frequent transaction. The
invoker backend cannot know whether the remote change happend
during a transaction and whether the transaction if any is
committed or aborted, no error message sent to invoker backend.
We could wait for the end of a trasaction but that doesn't work
with long transactions.
Maybe we don't need the feature in GUC system but adding another
similar feature doesn't seem reasonable. This would be useful for
some other tracking features.
4. pg_set_backend_guc() function.
Of course syscache statistics recording consumes significant amount of time so it
cannot be turned on usually. On the other hand since this feature is turned on by GUC,
it is needed to grab the active client connection to turn on/off the feature(but we
cannot). Instead, I provided a means to change GUC variables in another backend.pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
on the backend "pid" to "value".With the above tools, we can inspect catcache statistics of seemingly bloated process.
A. Find a bloated process pid using ps or something.
B. Turn on syscache stats on the process.
=# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000');C. Examine the statitics.
=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc
limit 3;
pid | relname | cache_name | size
------+--------------+----------------------------------+----------
9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
9984 | pg_cast | pg_cast_source_target_index | 4096
9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096=# select * from pg_stat_syscache where cache_name =
'pg_statistic_relid_att_inh_index'::regclass;
-[ RECORD 1 ]---------------------------------
pid | 9984
relname | pg_statistic
cache_name | pg_statistic_relid_att_inh_index
size | 11026176
ntuples | 77950
searches | 77950
hits | 0
neg_hits | 0
ageclass | {30,60,600,1200,1800,0}
nentries | {17630,16950,43370,0,0,0}
last_update | 2018-10-17 15:58:19.738164+09The output of this view seems good to me.
I can imagine this use case. Does the use case of setting GUC locally never happen?
I mean can the setting be locally changed?
Syscahe grows through a life of a backend/session. No other
client cannot connect to it at the same time. So the variable
must be set at the start of a backend using ALTER USER/DATABASE,
or the client itself is obliged to deliberitely turn on the
feature at a convenient time. I suppose that in most use cases
one wants to turn on this feature after he sees another session
is eating memory more and more.
The attached is the rebased version that has multidimentional
ageclass.
Thank you for the comments in the next mail but sorry that I'll
address them later.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Tue, Nov 27, 2018 at 11:40 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
The attached is the rebased version that has multidimentional
ageclass.
Thank you,
Just for the information, cfbot complains about this patch because:
pgstatfuncs.c: In function ‘pgstat_get_syscache_stats’:
pgstatfuncs.c:1973:8: error: ignoring return value of ‘fread’,
declared with attribute warn_unused_result [-Werror=unused-result]
fread(&cacheid, sizeof(int), 1, fpin);
^
pgstatfuncs.c:1974:8: error: ignoring return value of ‘fread’,
declared with attribute warn_unused_result [-Werror=unused-result]
fread(&last_update, sizeof(TimestampTz), 1, fpin);
^
I'm moving it to the next CF as "Waiting on author", since as far as I
understood you want to address more commentaries from the reviewer.
Hello,
Sorry for delay.
The detailed comments for the source code will be provided later.
I just thought that the pair of ageclass and nentries can be
represented as json or multi-dimensional array but in virtual they are
all same and can be converted each other using some functions. So I'm not surewhich representaion is better one.
Multi dimentional array in any style sounds reasonable. Maybe array is preferable in
system views as it is a basic type than JSON. In the attached, it looks like the follows:=# select * from pg_stat_syscache where ntuples > 100; -[ RECORD
1 ]--------------------------------------------------
pid | 1817
relname | pg_class
cache_name | pg_class_oid_index
size | 2048
ntuples | 189
searches | 1620
hits | 1431
neg_hits | 0
ageclass | {{30,189},{60,0},{600,0},{1200,0},{1800,0},{0,0}}
last_update | 2018-11-27 19:22:00.74026+09
Thanks, cool. That seems better to me.
3. non-transactional GUC setting (in 0003)
It allows setting GUC variable set by the action
GUC_ACTION_NONXACT(the name requires condieration) survive beyond
rollback. It is required by remote guc setting to work sanely.
Without the feature a remote-set value within a trasction will
disappear involved in rollback. The only local interface for the
NONXACT action is set_config(name, value, is_local=false, is_nonxact = true).pg_set_backend_guc() below works on this feature.
TBH, I'm not familiar with around this and I may be missing something.
In order to change the other backend's GUC value, is ignoring
transactional behevior always necessary? When transaction of GUC
setting is failed and rollbacked, if the error message is supposeed to
be reported I thought just trying the transaction again is enough.The target backend can be running frequent transaction. The invoker backend cannot
know whether the remote change happend during a transaction and whether the
transaction if any is committed or aborted, no error message sent to invoker backend.
We could wait for the end of a trasaction but that doesn't work with long transactions.Maybe we don't need the feature in GUC system but adding another similar feature
doesn't seem reasonable. This would be useful for some other tracking features.
Thank you for the clarification.
4. pg_set_backend_guc() function.
Of course syscache statistics recording consumes significant amount
of time so it cannot be turned on usually. On the other hand since
this feature is turned on by GUC, it is needed to grab the active
client connection to turn on/off the feature(but we cannot). Instead, I provided ameans to change GUC variables in another backend.
pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
on the backend "pid" to "value".With the above tools, we can inspect catcache statistics of seemingly bloated
process.
A. Find a bloated process pid using ps or something.
B. Turn on syscache stats on the process.
=# select pg_set_backend_guc(9984, 'track_syscache_usage_interval',
'10000');C. Examine the statitics.
=# select pid, relname, cache_name, size from pg_stat_syscache order
by size desc limit 3;
pid | relname | cache_name | size
------+--------------+----------------------------------+----------
9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
9984 | pg_cast | pg_cast_source_target_index | 4096
9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096=# select * from pg_stat_syscache where cache_name =
'pg_statistic_relid_att_inh_index'::regclass;
-[ RECORD 1 ]---------------------------------
pid | 9984
relname | pg_statistic
cache_name | pg_statistic_relid_att_inh_index
size | 11026176
ntuples | 77950
searches | 77950
hits | 0
neg_hits | 0
ageclass | {30,60,600,1200,1800,0}
nentries | {17630,16950,43370,0,0,0}
last_update | 2018-10-17 15:58:19.738164+09The output of this view seems good to me.
I can imagine this use case. Does the use case of setting GUC locally never happen?
I mean can the setting be locally changed?Syscahe grows through a life of a backend/session. No other client cannot connect to
it at the same time. So the variable must be set at the start of a backend using ALTER
USER/DATABASE, or the client itself is obliged to deliberitely turn on the feature at a
convenient time. I suppose that in most use cases one wants to turn on this feature
after he sees another session is eating memory more and more.The attached is the rebased version that has multidimentional ageclass.
Thank you! That's convenient.
How about splitting this non-xact guc and remote guc setting feature as another commit fest entry?
I'm planning to review 001 and 002 patch in more detail and hopefully turn it to 'ready for committer'
and review remote guc feature later.
Related to the feature division why have you discarded pruning of relcache and plancache?
Personally I want relcache one as well as catcache because regarding memory bloat there is some correlation between them.
Regards,
Takeshi Ideriha
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
The detailed comments for the source code will be provided later.
Hi,
I'm adding some comments to 0001 and 0002 one.
[0001 patch]
+ /*
+ * Calculate the duration from the time of the last access to the
+ * "current" time. Since catcacheclock is not advanced within a
+ * transaction, the entries that are accessed within the current
+ * transaction won't be pruned.
+ */
+ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+ /*
+ * Try to remove entries older than cache_prune_min_age seconds.
+ if (entry_age > cache_prune_min_age)
Can you change this comparison between entry_age and cache_prune_min_age
to "entry_age >= cache_prune_min_age"?
That is, I want the feature that entries that are accessed even within the transaction
is pruned in case of cache_prune_min_age = 0
I can hit upon some of my customer who want to always keep memory usage below certain limit as strictly as possible.
This kind of strict user would set cache_prune_min_age to 0 and would not want to exceed the memory target even
if within a transaction.
As I put miscellaneous comments about 0001 patch in some previous email, so please take a look at it.
[0002 patch]
I haven't looked into every detail but here are some comments.
Maybe you would also need to add some sentences to this page:
https://www.postgresql.org/docs/current/monitoring-stats.html
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
Function name like 'pg_stat_XXX' would match surrounding code.
When applying patch I found trailing whitespace warning:
../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:157: trailing whitespace.
../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:256: trailing whitespace.
../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:301: trailing whitespace.
../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:483: trailing whitespace.
../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:539: trailing whitespace.
Regards,
Takeshi Ideriha
I'm really disappointed by the direction this thread is going in.
The latest patches add an enormous amount of mechanism, and user-visible
complexity, to do something that we learned was a bad idea decades ago.
Putting a limit on the size of the syscaches doesn't accomplish anything
except to add cycles if your cache working set is below the limit, or
make performance fall off a cliff if it's above the limit. I don't think
there's any reason to believe that making it more complicated will avoid
that problem.
What does seem promising is something similar to Horiguchi-san's
original patches all the way back at
/messages/by-id/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp
That is, identify usage patterns in which we tend to fill the caches with
provably no-longer-useful entries, and improve those particular cases.
Horiguchi-san identified one such case in that message: negative entries
in the STATRELATTINH cache, caused by the planner probing for stats that
aren't there, and then not cleared when the relevant table gets dropped
(since, by definition, they don't match any pg_statistic entry that gets
deleted). We saw another recent report of the same problem at
/messages/by-id/2114009259.1866365.1544469996900@mail.yahoo.com
so I'd been thinking about ways to fix that case in particular. I came
up with a fix that I think is simpler and a bit more efficient than
what Horiguchi-san proposed originally: rather than trying to reverse-
engineer what to do in low-level cache callbacks, let's have the catalog
manipulation code explicitly send out invalidation commands when the
relevant situations arise. In the attached, heap.c's RemoveStatistics
sends out an sinval message commanding deletion of negative STATRELATTINH
entries that match the OID of the table being deleted. We could use the
same infrastructure to clean out dead RELNAMENSP entries after a schema
deletion, as per Horiguchi-san's second original suggestion; although
I haven't done so here because I'm not really convinced that that's got
an attractive cost-benefit ratio. (In both my patch and Horiguchi-san's,
we have to traverse all entries in the affected cache, so sending out one
of these messages is potentially not real cheap.)
To do this we need to adjust the representation of sinval messages so
that we can have two different kinds of messages that include a cache ID.
Fortunately, because there's padding space available, that's not costly.
0001 below is a simple refactoring patch that converts the message type
ID into a plain enum field that's separate from the cache ID if any.
(I'm inclined to apply this whether or not people like 0002: it makes
the code clearer, more maintainable, and probably a shade faster thanks
to replacing an if-then-else chain with a switch.) Then 0002 adds the
feature of an sinval message type saying "delete negative entries in
cache X that have OID Y in key column Z", and teaches RemoveStatistics
to use that.
Thoughts?
regards, tom lane
Attachments:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
I'm really disappointed by the direction this thread is going in.
The latest patches add an enormous amount of mechanism, and user-visible
complexity, to do something that we learned was a bad idea decades ago.
Putting a limit on the size of the syscaches doesn't accomplish anything
except to add cycles if your cache working set is below the limit, or make
performance fall off a cliff if it's above the limit. I don't think there's
any reason to believe that making it more complicated will avoid that
problem.What does seem promising is something similar to Horiguchi-san's original
patches all the way back at/messages/by-id/20161219.201505.11562604.horiguc
hi.kyotaro@lab.ntt.co.jp
so I'd been thinking about ways to fix that case in particular.
You're suggesting to go back to the original issue (bloat by negative cache entries) and give simpler solution to it once, aren't you? That may be the way to go.
But the syscache/relcache bloat still remains a problem, when there are many live tables and application connections. Would you agree to solve this in some way? I thought Horiguchi-san's latest patches would solve this and the negative entries. Can we consider that his patch and yours are orthogonal, i.e., we can pursue Horiguchi-san's patch after yours is committed?
(As you said, some parts of Horiguchi-san's patches may be made simpler. For example, the ability to change another session's GUC variable can be discussed in a separate thread.)
I think we need some limit to the size of the relcache, syscache, and plancache. Oracle and MySQL both have it, using LRU to evict less frequently used entries. You seem to be concerned about the LRU management based on your experience, but would it really cost so much as long as each postgres process can change the LRU list without coordination with other backends now? Could you share your experience?
FYI, Oracle provides one parameter, shared_pool_size, that determine the size of a memory area that contains SQL plans and various dictionary objects. Oracle decides how to divide the area among constituents. So it could be possible that one component (e.g. table/index metadata) is short of space, and another (e.g. SQL plans) has free space. Oracle provides a system view to see the free space and hit/miss of each component. If one component suffers from memory shortage, the user increases shared_pool_size. This is similar to what Horiguchi-san is proposing.
MySQL enables fine-tuning of each component. It provides the size parameters for six memory partitions of the dictionary object cache, and the usage statistics of those partitions through the Performance Schema.
tablespace definition cache
schema definition cache
table definition cache
stored program definition cache
character set definition cache
collation definition cache
I wonder whether we can group existing relcache/syscache entries like this.
[MySQL]
14.4 Dictionary Object Cache
https://dev.mysql.com/doc/refman/8.0/en/data-dictionary-object-cache.html
--------------------------------------------------
The dictionary object cache is a shared global cache that stores previously accessed data dictionary objects in memory to enable object reuse and minimize disk I/O. Similar to other cache mechanisms used by MySQL, the dictionary object cache uses an LRU-based eviction strategy to evict least recently used objects from memory.
The dictionary object cache comprises cache partitions that store different object types. Some cache partition size limits are configurable, whereas others are hardcoded.
--------------------------------------------------
8.12.3.1 How MySQL Uses Memory
https://dev.mysql.com/doc/refman/8.0/en/memory-use.html
--------------------------------------------------
table_open_cache
MySQL requires memory and descriptors for the table cache.
table_definition_cache
For InnoDB, table_definition_cache acts as a soft limit for the number of open table instances in the InnoDB data dictionary cache. If the number of open table instances exceeds the table_definition_cache setting, the LRU mechanism begins to mark table instances for eviction and eventually removes them from the data dictionary cache. The limit helps address situations in which significant amounts of memory would be used to cache rarely used table instances until the next server restart.
--------------------------------------------------
Regards
Takayuki Tsunakawa
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
But the syscache/relcache bloat still remains a problem, when there are many live tables and application connections. Would you agree to solve this in some way? I thought Horiguchi-san's latest patches would solve this and the negative entries. Can we consider that his patch and yours are orthogonal, i.e., we can pursue Horiguchi-san's patch after yours is committed?
Certainly, what I've done here doesn't preclude adding some wider solution to
the issue of extremely large catcaches. I think it takes the pressure off
for one rather narrow problem case, and the mechanism could be used to fix
other ones. But if you've got an application that just plain accesses a
huge number of objects, this isn't going to make your life better.
(As you said, some parts of Horiguchi-san's patches may be made simpler. For example, the ability to change another session's GUC variable can be discussed in a separate thread.)
Yeah, that idea seems just bad from here ...
I think we need some limit to the size of the relcache, syscache, and plancache. Oracle and MySQL both have it, using LRU to evict less frequently used entries. You seem to be concerned about the LRU management based on your experience, but would it really cost so much as long as each postgres process can change the LRU list without coordination with other backends now? Could you share your experience?
Well, we *had* an LRU mechanism for the catcaches way back when. We got
rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
info was expensive and (b) performance fell off a cliff in scenarios where
the cache size limit was exceeded. You could probably find some more info
about that by scanning the mail list archives from around the time of that
commit, but I'm too lazy to do so right now.
That was a dozen years ago, and it's possible that machine performance
has moved so much since then that the problems are gone or mitigated.
In particular I'm sure that any limit we would want to impose today will
be far more than the 5000-entries-across-all-caches limit that was in use
back then. But I'm not convinced that a workload that would create 100K
cache entries in the first place wouldn't have severe problems if you
tried to constrain it to use only 80K entries. I fear it's just wishful
thinking to imagine that the behavior of a larger cache won't be just
like a smaller one. Also, IIRC some of the problem with the LRU code
was that it resulted in lots of touches of unrelated data, leading to
CPU cache miss problems. It's hard to see how that doesn't get even
worse with a bigger cache.
As far as the relcache goes, we've never had a limit on that, but there
are enough routine causes of relcache flushes --- autovacuum for instance
--- that I'm not really convinced relcache bloat can be a big problem in
production.
The plancache has never had a limit either, which is a design choice that
was strongly influenced by our experience with catcaches. Again, I'm
concerned about the costs of adding a management layer, and the likelihood
that cache flushes will simply remove entries we'll soon have to rebuild.
FYI, Oracle provides one parameter, shared_pool_size, that determine the
size of a memory area that contains SQL plans and various dictionary
objects. Oracle decides how to divide the area among constituents. So
it could be possible that one component (e.g. table/index metadata) is
short of space, and another (e.g. SQL plans) has free space. Oracle
provides a system view to see the free space and hit/miss of each
component. If one component suffers from memory shortage, the user
increases shared_pool_size. This is similar to what Horiguchi-san is
proposing.
Oracle seldom impresses me as having designs we ought to follow.
They have a well-earned reputation for requiring a lot of expertise to
operate, which is not the direction this project should be going in.
In particular, I don't want to "solve" cache size issues by exposing
a bunch of knobs that most users won't know how to twiddle.
regards, tom lane
Hi,
On 2019-01-15 13:32:36 -0500, Tom Lane wrote:
Well, we *had* an LRU mechanism for the catcaches way back when. We got
rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
info was expensive and (b) performance fell off a cliff in scenarios where
the cache size limit was exceeded. You could probably find some more info
about that by scanning the mail list archives from around the time of that
commit, but I'm too lazy to do so right now.That was a dozen years ago, and it's possible that machine performance
has moved so much since then that the problems are gone or mitigated.
In particular I'm sure that any limit we would want to impose today will
be far more than the 5000-entries-across-all-caches limit that was in use
back then. But I'm not convinced that a workload that would create 100K
cache entries in the first place wouldn't have severe problems if you
tried to constrain it to use only 80K entries.
I think that'd be true if you the accesses were truly randomly
distributed - but that's not the case in the cases where I've seen huge
caches. It's usually workloads that have tons of functions, partitions,
... and a lot of them are not that frequently accessed, but because we
have no cache purging mechanism stay around for a long time. This is
often exascerbated by using a pooler to keep connections around for
longer (which you have to, to cope with other limits of PG).
As far as the relcache goes, we've never had a limit on that, but there are enough routine causes of relcache flushes --- autovacuum for instance --- that I'm not really convinced relcache bloat can be a big problem in production.
It definitely is.
The plancache has never had a limit either, which is a design choice that
was strongly influenced by our experience with catcaches.
This sounds a lot of having learned lessons from one bad implementation
and using it far outside of that situation.
Greetings,
Andres Freund
On Tue, Jan 15, 2019 at 01:32:36PM -0500, Tom Lane wrote:
...
FYI, Oracle provides one parameter, shared_pool_size, that determine the
size of a memory area that contains SQL plans and various dictionary
objects. Oracle decides how to divide the area among constituents. So
it could be possible that one component (e.g. table/index metadata) is
short of space, and another (e.g. SQL plans) has free space. Oracle
provides a system view to see the free space and hit/miss of each
component. If one component suffers from memory shortage, the user
increases shared_pool_size. This is similar to what Horiguchi-san is
proposing.Oracle seldom impresses me as having designs we ought to follow.
They have a well-earned reputation for requiring a lot of expertise to
operate, which is not the direction this project should be going in.
In particular, I don't want to "solve" cache size issues by exposing
a bunch of knobs that most users won't know how to twiddle.regards, tom lane
+1
Regards,
Ken
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Certainly, what I've done here doesn't preclude adding some wider solution
to
the issue of extremely large catcaches.
I'm relieved to hear that.
I think it takes the pressure off
for one rather narrow problem case, and the mechanism could be used to fix
other ones. But if you've got an application that just plain accesses a
huge number of objects, this isn't going to make your life better.
I understand you're trying to solve the problem caused by negative cache entries as soon as possible, because the user is really suffering from it. I feel sympathy with that attitude, because you seem to be always addressing issues that others are reluctant to take. That's one of the reasons I respect you.
Well, we *had* an LRU mechanism for the catcaches way back when. We got
rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
info was expensive and (b) performance fell off a cliff in scenarios where
the cache size limit was exceeded. You could probably find some more info
about that by scanning the mail list archives from around the time of that
commit, but I'm too lazy to do so right now.
Oh, in 2006... I'll examine the patch and the discussion to see how the LRU management was done.
That was a dozen years ago, and it's possible that machine performance
has moved so much since then that the problems are gone or mitigated.
I really, really hope so. Even if we see some visible impact by the LRU management, I think that's the debt PostgreSQL had to pay for but doesn't now. Even the single-process MySQL, which doesn't suffer from cache bloat for many server processes, has the ability to limit the cache. And PostgreSQL has many parameters for various memory components such as shared_buffers, wal_buffers, work_mem, etc, so it would be reasonable to also have the limit for the catalog caches. That said, we can avoid the penalty and retain the current performance by disabling the limit (some_size_param = 0).
I think we'll evaluate the impact of LRU management by adding prev and next members to catcache and relcache structures, and putting the entry at the front (or back) of the LRU chain every time the entry is obtained. I think pgbench's select-only mode is enough for evaluation. I'd like to hear if any other workload is more appropriate to see the CPU cache effect.
In particular I'm sure that any limit we would want to impose today will
be far more than the 5000-entries-across-all-caches limit that was in use
back then. But I'm not convinced that a workload that would create 100K
cache entries in the first place wouldn't have severe problems if you
tried to constrain it to use only 80K entries. I fear it's just wishful
thinking to imagine that the behavior of a larger cache won't be just
like a smaller one. Also, IIRC some of the problem with the LRU code
was that it resulted in lots of touches of unrelated data, leading to
CPU cache miss problems. It's hard to see how that doesn't get even
worse with a bigger cache.As far as the relcache goes, we've never had a limit on that, but there are enough routine causes of relcache flushes --- autovacuum for instance --- that I'm not really convinced relcache bloat can be a big problem in production.
As Andres and Robert mentioned, we want to free less frequently used cache entries. Otherwise, we're now suffering from the bloat to TBs of memory. This is a real, not hypothetical issue...
The plancache has never had a limit either, which is a design choice that
was strongly influenced by our experience with catcaches. Again, I'm
concerned about the costs of adding a management layer, and the likelihood
that cache flushes will simply remove entries we'll soon have to rebuild.
Fortunately, we're not bothered with the plan cache. But I remember you said you were annoyed by PL/pgSQL's plan cache use at Salesforce. Were you able to overcome it somehow?
Oracle seldom impresses me as having designs we ought to follow.
They have a well-earned reputation for requiring a lot of expertise to
operate, which is not the direction this project should be going in.
In particular, I don't want to "solve" cache size issues by exposing
a bunch of knobs that most users won't know how to twiddle.
Oracle certainly seems to be difficult to use. But they seem to be studying other DBMSs to make it simpler to use. I'm sure they also have a lot we should learn, and the cache limit if one of them (although MySQL's per-cache tuning may be better.)
And having limits for various components would be the first step toward the autonomous database; tunable -> auto tuning -> autonomous
Regards
Takayuki Tsunakawa
On Sun, Jan 13, 2019 at 11:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Putting a limit on the size of the syscaches doesn't accomplish anything
except to add cycles if your cache working set is below the limit, or
make performance fall off a cliff if it's above the limit.
If you're running on a Turing machine, sure. But real machines have
finite memory, or at least all the ones I use do. Horiguchi-san is
right that this is a real, not theoretical problem. It is one of the
most frequent operational concerns that EnterpriseDB customers have.
I'm not against solving specific cases with more targeted fixes, but I
really believe we need something more. Andres mentioned one problem
case: connection poolers that eventually end up with a cache entry for
every object in the system. Another case is that of people who keep
idle connections open for long periods of time; those connections can
gobble up large amounts of memory even though they're not going to use
any of their cache entries any time soon.
The flaw in your thinking, as it seems to me, is that in your concern
for "the likelihood that cache flushes will simply remove entries
we'll soon have to rebuild," you're apparently unwilling to consider
the possibility of workloads where cache flushes will remove entries
we *won't* soon have to rebuild. Every time that issue gets raised,
you seem to blow it off as if it were not a thing that really happens.
I can't make sense of that position. Is it really so hard to imagine
a connection pooler that switches the same connection back and forth
between two applications with different working sets? Or a system
that keeps persistent connections open even when they are idle? Do
you really believe that a connection that has not accessed a cache
entry in 10 minutes still derives more benefit from that cache entry
than it would from freeing up some memory?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote:
The flaw in your thinking, as it seems to me, is that in your concern
for "the likelihood that cache flushes will simply remove entries
we'll soon have to rebuild," you're apparently unwilling to consider
the possibility of workloads where cache flushes will remove entries
we *won't* soon have to rebuild. Every time that issue gets raised,
you seem to blow it off as if it were not a thing that really happens.
I can't make sense of that position. Is it really so hard to imagine
a connection pooler that switches the same connection back and forth
between two applications with different working sets? Or a system
that keeps persistent connections open even when they are idle? Do
you really believe that a connection that has not accessed a cache
entry in 10 minutes still derives more benefit from that cache entry
than it would from freeing up some memory?
Well, I think everyone agrees there are workloads that cause undesired
cache bloat. What we have not found is a solution that doesn't cause
code complexity or undesired overhead, or one that >1% of users will
know how to use.
Unfortunately, because we have not found something we are happy with, we
have done nothing. I agree LRU can be expensive. What if we do some
kind of clock sweep and expiration like we do for shared buffers? I
think the trick is figuring how frequently to do the sweep. What if we
mark entries as unused every 10 queries, mark them as used on first use,
and delete cache entries that have not be used in the past 10 queries.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
On 18/01/2019 08:48, Bruce Momjian wrote:
On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote:
The flaw in your thinking, as it seems to me, is that in your concern
for "the likelihood that cache flushes will simply remove entries
we'll soon have to rebuild," you're apparently unwilling to consider
the possibility of workloads where cache flushes will remove entries
we *won't* soon have to rebuild. Every time that issue gets raised,
you seem to blow it off as if it were not a thing that really happens.
I can't make sense of that position. Is it really so hard to imagine
a connection pooler that switches the same connection back and forth
between two applications with different working sets? Or a system
that keeps persistent connections open even when they are idle? Do
you really believe that a connection that has not accessed a cache
entry in 10 minutes still derives more benefit from that cache entry
than it would from freeing up some memory?Well, I think everyone agrees there are workloads that cause undesired
cache bloat. What we have not found is a solution that doesn't cause
code complexity or undesired overhead, or one that >1% of users will
know how to use.Unfortunately, because we have not found something we are happy with, we
have done nothing. I agree LRU can be expensive. What if we do some
kind of clock sweep and expiration like we do for shared buffers? I
think the trick is figuring how frequently to do the sweep. What if we
mark entries as unused every 10 queries, mark them as used on first use,
and delete cache entries that have not be used in the past 10 queries.
If you take that approach, then this number should be configurable.Â
What if I had 12 common queries I used in rotation?
The ARM3 processor cache logic was to simply eject an entry at random,
as the obviously Acorn felt that the silicon required to have a more
sophisticated algorithm would reduce the cache size too much!
I upgraded my Acorn Archimedes that had an 8MHZ bus, from an 8MHz ARM2
to a 25MZ ARM3. that is a clock rate improvement of about 3 times.Â
However BASIC programs ran about 7 times faster, which I put down to the
ARM3 having a cache.
Obviously for Postgres this is not directly relevant, but I think it
suggests that it may be worth considering replacing cache items at
random. As there are no pathological corner cases, and the logic is
very simple.
Cheers,
Gavin
Hello.
At Fri, 18 Jan 2019 11:46:03 +1300, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote in <4e62e6b7-0ffb-54ae-3757-5583fcca38c0@archidevsys.co.nz>
On 18/01/2019 08:48, Bruce Momjian wrote:
On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote:
The flaw in your thinking, as it seems to me, is that in your concern
for "the likelihood that cache flushes will simply remove entries
we'll soon have to rebuild," you're apparently unwilling to consider
the possibility of workloads where cache flushes will remove entries
we *won't* soon have to rebuild. Every time that issue gets raised,
you seem to blow it off as if it were not a thing that really happens.
I can't make sense of that position. Is it really so hard to imagine
a connection pooler that switches the same connection back and forth
between two applications with different working sets? Or a system
that keeps persistent connections open even when they are idle? Do
you really believe that a connection that has not accessed a cache
entry in 10 minutes still derives more benefit from that cache entry
than it would from freeing up some memory?Well, I think everyone agrees there are workloads that cause undesired
cache bloat. What we have not found is a solution that doesn't cause
code complexity or undesired overhead, or one that >1% of users will
know how to use.Unfortunately, because we have not found something we are happy with,
we
have done nothing. I agree LRU can be expensive. What if we do some
kind of clock sweep and expiration like we do for shared buffers? I
So, it doesn't use LRU but a kind of clock-sweep method. If it
finds the size is about to exceed the threshold by
resiz(doubl)ing when the current hash is filled up, it tries to
trim away the entries that are left for a duration corresponding
to usage count. This is not a hard limit but seems to be a good
compromise.
think the trick is figuring how frequently to do the sweep. What if
we
mark entries as unused every 10 queries, mark them as used on first
use,
and delete cache entries that have not be used in the past 10 queries.
As above, it tires pruning at every resizing time. So this adds
complexity to the frequent paths only by setting last accessed
time and incrementing access counter. It scans the whole hash at
resize time but it doesn't add much comparing to resizing itself.
If you take that approach, then this number should be configurable.
What if I had 12 common queries I used in rotation?
This basically has two knobs. The minimum hash size to do the
pruning and idle time before reaping unused entries, per
catcache.
The ARM3 processor cache logic was to simply eject an entry at random,
as the obviously Acorn felt that the silicon required to have a more
sophisticated algorithm would reduce the cache size too much!I upgraded my Acorn Archimedes that had an 8MHZ bus, from an 8MHz ARM2
to a 25MZ ARM3. that is a clock rate improvement of about 3 times.
However BASIC programs ran about 7 times faster, which I put down to
the ARM3 having a cache.Obviously for Postgres this is not directly relevant, but I think it
suggests that it may be worth considering replacing cache items at
random. As there are no pathological corner cases, and the logic is
very simple.
Memory was expensive than nowadays by.. about 10^3 times? An
obvious advantage of random reaping is requiring less silicon. I
think we don't need to be so stingy but perhaps clock-sweep is at
the maximum we can pay.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 18 Jan 2019 16:39:29 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190118.163929.229869562.horiguchi.kyotaro@lab.ntt.co.jp>
Hello.
At Fri, 18 Jan 2019 11:46:03 +1300, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote in <4e62e6b7-0ffb-54ae-3757-5583fcca38c0@archidevsys.co.nz>
On 18/01/2019 08:48, Bruce Momjian wrote:
Unfortunately, because we have not found something we are happy with,
we
have done nothing. I agree LRU can be expensive. What if we do some
kind of clock sweep and expiration like we do for shared buffers? ISo, it doesn't use LRU but a kind of clock-sweep method. If it
finds the size is about to exceed the threshold by
resiz(doubl)ing when the current hash is filled up, it tries to
trim away the entries that are left for a duration corresponding
to usage count. This is not a hard limit but seems to be a good
compromise.think the trick is figuring how frequently to do the sweep. What if
we
mark entries as unused every 10 queries, mark them as used on first
use,
and delete cache entries that have not be used in the past 10 queries.As above, it tires pruning at every resizing time. So this adds
complexity to the frequent paths only by setting last accessed
time and incrementing access counter. It scans the whole hash at
resize time but it doesn't add much comparing to resizing itself.If you take that approach, then this number should be configurable.
What if I had 12 common queries I used in rotation?This basically has two knobs. The minimum hash size to do the
pruning and idle time before reaping unused entries, per
catcache.
This is the rebased version.
0001: catcache pruning
syscache_memory_target controls per-cache basis minimum size
where this starts pruning.
syscache_prune_min_time controls minimum idle duration until an
catcache entry is removed.
0002: catcache statistics view
track_syscache_usage_interval is the interval statitics of
catcache is collected.
pg_stat_syscache is the view that shows the statistics.
0003: Remote GUC setting
It is independent from the above two, and heavily arguable.
pg_set_backend_config(pid, name, value) changes the GUC <name> on
the backend with <pid> to <value>.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote:
Well, I think everyone agrees there are workloads that cause undesired
cache bloat. What we have not found is a solution that doesn't cause
code complexity or undesired overhead, or one that >1% of users will
know how to use.Unfortunately, because we have not found something we are happy with, we
have done nothing. I agree LRU can be expensive. What if we do some
kind of clock sweep and expiration like we do for shared buffers? I
think the trick is figuring how frequently to do the sweep. What if we
mark entries as unused every 10 queries, mark them as used on first use,
and delete cache entries that have not be used in the past 10 queries.
I still think wall-clock time is a perfectly reasonable heuristic.
Say every 5 or 10 minutes you walk through the cache. Anything that
hasn't been touched since the last scan you throw away. If you do
this, you MIGHT flush an entry that you're just about to need again,
but (1) it's not very likely, because if it hasn't been touched in
many minutes, the chances that it's about to be needed again are low,
and (2) even if it does happen, it probably won't cost all that much,
because *occasionally* reloading a cache entry unnecessarily isn't
that costly; the big problem is when you do it over and over again,
which can easily happen with a fixed size limit on the cache, and (3)
if somebody does have a workload where they touch the same object
every 11 minutes, we can give them a GUC to control the timeout
between cache sweeps and it's really not that hard to understand how
to set it. And most people won't need to.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote:
Unfortunately, because we have not found something we are happy with, we
have done nothing. I agree LRU can be expensive. What if we do some
kind of clock sweep and expiration like we do for shared buffers? I
think the trick is figuring how frequently to do the sweep. What if we
mark entries as unused every 10 queries, mark them as used on first use,
and delete cache entries that have not be used in the past 10 queries.
I still think wall-clock time is a perfectly reasonable heuristic.
The easy implementations of that involve putting gettimeofday() calls
into hot code paths, which would be a Bad Thing. But maybe we could
do this only at transaction or statement start, and piggyback on the
gettimeofday() calls that already happen at those times.
regards, tom lane
On 2019-01-18 15:57:17 -0500, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote:
Unfortunately, because we have not found something we are happy with, we
have done nothing. I agree LRU can be expensive. What if we do some
kind of clock sweep and expiration like we do for shared buffers? I
think the trick is figuring how frequently to do the sweep. What if we
mark entries as unused every 10 queries, mark them as used on first use,
and delete cache entries that have not be used in the past 10 queries.I still think wall-clock time is a perfectly reasonable heuristic.
The easy implementations of that involve putting gettimeofday() calls
into hot code paths, which would be a Bad Thing. But maybe we could
do this only at transaction or statement start, and piggyback on the
gettimeofday() calls that already happen at those times.
My proposal for this was to attach a 'generation' to cache entries. Upon
access cache entries are marked to be of the current
generation. Whenever existing memory isn't sufficient for further cache
entries and, on a less frequent schedule, triggered by a timer, the
cache generation is increased and th new generation's "creation time" is
measured. Then generations that are older than a certain threshold are
purged, and if there are any, the entries of the purged generation are
removed from the caches using a sequential scan through the cache.
This outline achieves:
- no additional time measurements in hot code paths
- no need for a sequential scan of the entire cache when no generations
are too old
- both size and time limits can be implemented reasonably cheaply
- overhead when feature disabled should be close to zero
Greetings,
Andres Freund
On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
My proposal for this was to attach a 'generation' to cache entries. Upon
access cache entries are marked to be of the current
generation. Whenever existing memory isn't sufficient for further cache
entries and, on a less frequent schedule, triggered by a timer, the
cache generation is increased and th new generation's "creation time" is
measured. Then generations that are older than a certain threshold are
purged, and if there are any, the entries of the purged generation are
removed from the caches using a sequential scan through the cache.This outline achieves:
- no additional time measurements in hot code paths
- no need for a sequential scan of the entire cache when no generations
are too old
- both size and time limits can be implemented reasonably cheaply
- overhead when feature disabled should be close to zero
Seems generally reasonable. The "whenever existing memory isn't
sufficient for further cache entries" part I'm not sure about.
Couldn't that trigger very frequently and prevent necessary cache size
growth?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
My proposal for this was to attach a 'generation' to cache entries. Upon
access cache entries are marked to be of the current
generation. Whenever existing memory isn't sufficient for further cache
entries and, on a less frequent schedule, triggered by a timer, the
cache generation is increased and th new generation's "creation time" is
measured. Then generations that are older than a certain threshold are
purged, and if there are any, the entries of the purged generation are
removed from the caches using a sequential scan through the cache.This outline achieves:
- no additional time measurements in hot code paths
- no need for a sequential scan of the entire cache when no generations
are too old
- both size and time limits can be implemented reasonably cheaply
- overhead when feature disabled should be close to zeroSeems generally reasonable. The "whenever existing memory isn't
sufficient for further cache entries" part I'm not sure about.
Couldn't that trigger very frequently and prevent necessary cache size
growth?
I'm thinking it'd just trigger a new generation, with it's associated
"creation" time (which is cheap to acquire in comparison to creating a
number of cache entries) . Depending on settings or just code policy we
can decide up to which generation to prune the cache, using that
creation time. I'd imagine that we'd have some default cache-pruning
time in the minutes, and for workloads where relevant one can make
sizing configurations more aggressive - or something like that.
Greetings,
Andres Freund
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
0003: Remote GUC setting
It is independent from the above two, and heavily arguable.
pg_set_backend_config(pid, name, value) changes the GUC <name> on the
backend with <pid> to <value>.
Not having looked at the code yet, why did you think this is necessary? Can't we always collect the cache stats? Is it heavy due to some locking in the shared memory, or sending the stats to the stats collector?
Regards
Takayuki Tsunakawa
Hello.
At Fri, 18 Jan 2019 17:09:41 -0800, "andres@anarazel.de" <andres@anarazel.de> wrote in <20190119010941.6ruftewah7t3k3yk@alap3.anarazel.de>
Hi,
On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
My proposal for this was to attach a 'generation' to cache entries. Upon
access cache entries are marked to be of the current
generation. Whenever existing memory isn't sufficient for further cache
entries and, on a less frequent schedule, triggered by a timer, the
cache generation is increased and th new generation's "creation time" is
measured. Then generations that are older than a certain threshold are
purged, and if there are any, the entries of the purged generation are
removed from the caches using a sequential scan through the cache.This outline achieves:
- no additional time measurements in hot code paths
It is caused at every transaction start time and stored in
TimestampTz in this patch. No additional time measurement exists
already but cache puruing won't happen if a transaction lives for
a long time. Time-driven generation value, maybe with 10s-1min
fixed interval, is a possible option.
- no need for a sequential scan of the entire cache when no generations
are too old
This patch didn't precheck against the oldest generation, but it
can be easily calculated. (But doesn't base on the creation time
but on the last-access time.) (Attached applies over the
v7-0001-Remove-entries-..patch)
Using generation time, entries are purged even if it is recently
accessed. I think last-accessed time is more sutable for the
purpse. On the other hand using last-accessed time, the oldest
generation can be stale by later access.
- both size and time limits can be implemented reasonably cheaply
- overhead when feature disabled should be close to zero
Overhead when disabled is already nothing since scanning is
inhibited when cache_prune_min_age is a negative value.
Seems generally reasonable. The "whenever existing memory isn't
sufficient for further cache entries" part I'm not sure about.
Couldn't that trigger very frequently and prevent necessary cache size
growth?I'm thinking it'd just trigger a new generation, with it's associated
"creation" time (which is cheap to acquire in comparison to creating a
number of cache entries) . Depending on settings or just code policy we
can decide up to which generation to prune the cache, using that
creation time. I'd imagine that we'd have some default cache-pruning
time in the minutes, and for workloads where relevant one can make
sizing configurations more aggressive - or something like that.
The current patch uses last-accesed time by non-gettimeofday()
method. The genreation is fixed up to 3 and infrequently-accessed
entries are removed sooner. Generation interval is determined by
cache_prune_min_age.
Although this doesn't put a hard cap on memory usage, it is
indirectly and softly limited by the cache_prune_min_age and
cache_memory_target, which determins how large a cache can grow
until pruning happens. They are per-cache basis.
If we prefer to set a budget on all the syschaches (or even
including other caches), it would be more complex.
regares.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Thank you for pointing out the stupidity. (Tom did earlier, though.)
At Mon, 21 Jan 2019 07:12:41 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB6C78A@G01JPEXMBYT05>
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
0003: Remote GUC setting
It is independent from the above two, and heavily arguable.
pg_set_backend_config(pid, name, value) changes the GUC <name> on the
backend with <pid> to <value>.Not having looked at the code yet, why did you think this is necessary? Can't we always collect the cache stats? Is it heavy due to some locking in the shared memory, or sending the stats to the stats collector?
Yeah, I had a fun making it but I don't think it can be said very
good. I must admit that it is a kind of too-much or something
stupid.
Anyway it needs to scan the whole hash to collect numbers and I
don't see how to elimite the complexity without a penalty on
regular code paths for now. I don't want do that always for the
reason.
An option is an additional PGPROC member and interface functions.
struct PGPROC
{
...
int syscahe_usage_track_interval; /* track interval, 0 to disable */
=# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
=# select syscahce_usage_track_remove(2134);
Or, just provide an one-shot triggering function.
=# select syscahce_take_usage_track(<pid>);
This can use both a similar PGPROC variable or SendProcSignal()
but the former doesn't fire while idle time unless using timer.
Any thoughts?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Jan 18, 2019 at 05:09:41PM -0800, Andres Freund wrote:
Hi,
On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
My proposal for this was to attach a 'generation' to cache entries. Upon
access cache entries are marked to be of the current
generation. Whenever existing memory isn't sufficient for further cache
entries and, on a less frequent schedule, triggered by a timer, the
cache generation is increased and th new generation's "creation time" is
measured. Then generations that are older than a certain threshold are
purged, and if there are any, the entries of the purged generation are
removed from the caches using a sequential scan through the cache.This outline achieves:
- no additional time measurements in hot code paths
- no need for a sequential scan of the entire cache when no generations
are too old
- both size and time limits can be implemented reasonably cheaply
- overhead when feature disabled should be close to zeroSeems generally reasonable. The "whenever existing memory isn't
sufficient for further cache entries" part I'm not sure about.
Couldn't that trigger very frequently and prevent necessary cache size
growth?I'm thinking it'd just trigger a new generation, with it's associated
"creation" time (which is cheap to acquire in comparison to creating a
number of cache entries) . Depending on settings or just code policy we
can decide up to which generation to prune the cache, using that
creation time. I'd imagine that we'd have some default cache-pruning
time in the minutes, and for workloads where relevant one can make
sizing configurations more aggressive - or something like that.
OK, so it seems everyone likes the idea of a timer. The open questions
are whether we want multiple epochs, and whether we want some kind of
size trigger.
With only one time epoch, if the timer is 10 minutes, you could expire an
entry after 10-19 minutes, while with a new epoch every minute and
10-minute expire, you can do 10-11 minute precision. I am not sure the
complexity is worth it.
For a size trigger, should removal be effected by how many expired cache
entries there are? If there were 10k expired entries or 50, wouldn't
you want them removed if they have not been accessed in X minutes?
In the worst case, if 10k entries were accessed in a query and never
accessed again, what would the ideal cleanup behavior be? Would it
matter if it was expired in 10 or 19 minutes? Would it matter if there
were only 50 entries?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Although this doesn't put a hard cap on memory usage, it is indirectly and
softly limited by the cache_prune_min_age and cache_memory_target, which
determins how large a cache can grow until pruning happens. They are
per-cache basis.If we prefer to set a budget on all the syschaches (or even including other
caches), it would be more complex.
This is a pure question. How can we answer these questions from users?
* What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100?
* How much RAM do I need to have for the caches when I set cache_memory_target = 1M?
The user tends to estimate memory to avoid OOM.
Regards
Takayuki Tsunakawa
At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp>
An option is an additional PGPROC member and interface functions.
struct PGPROC
{
...
int syscahe_usage_track_interval; /* track interval, 0 to disable */=# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
=# select syscahce_usage_track_remove(2134);Or, just provide an one-shot triggering function.
=# select syscahce_take_usage_track(<pid>);
This can use both a similar PGPROC variable or SendProcSignal()
but the former doesn't fire while idle time unless using timer.
The attached is revised version of this patchset, where the third
patch is the remote setting feature. It uses static shared memory.
=# select pg_backend_catcache_stats(<pid>, <millis>);
Activates or changes catcache stats feature on the backend with
PID. (The name should be changed to .._syscache_stats, though.)
It is far smaller than the remote-GUC feature. (It contains a
part that should be in the previous patch. I will fix it later.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Wed, Jan 23, 2019 at 05:35:02PM +0900, Kyotaro HORIGUCHI wrote:
At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp>
An option is an additional PGPROC member and interface functions.
struct PGPROC
{
...
int syscahe_usage_track_interval; /* track interval, 0 to disable */=# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
=# select syscahce_usage_track_remove(2134);Or, just provide an one-shot triggering function.
=# select syscahce_take_usage_track(<pid>);
This can use both a similar PGPROC variable or SendProcSignal()
but the former doesn't fire while idle time unless using timer.The attached is revised version of this patchset, where the third
patch is the remote setting feature. It uses static shared memory.=# select pg_backend_catcache_stats(<pid>, <millis>);
Activates or changes catcache stats feature on the backend with
PID. (The name should be changed to .._syscache_stats, though.)
It is far smaller than the remote-GUC feature. (It contains a
part that should be in the previous patch. I will fix it later.)
I have a few questions to make sure we have not made the API too
complex. First, for syscache_prune_min_age, that is the minimum age
that we prune, and entries could last twice that long. Is there any
value to doing the scan at 50% of the age so that the
syscache_prune_min_age is the max age? For example, if our age cutoff
is 10 minutes, we could scan every 5 minutes so 10 minutes would be the
maximum age kept.
Second, when would you use syscache_memory_target != 0? If you had
syscache_prune_min_age really fast, e.g. 10 seconds? What is the
use-case for this? You have a query that touches 10k objects, and then
the connection stays active but doesn't touch many of those 10k objects,
and you want it cleaned up in seconds instead of minutes? (I can't see
why you would not clean up all unreferenced objects after _minutes_ of
disuse, but removing them after seconds of disuse seems undesirable.)
What are the odds you would retain the entires you want with a fast
target?
What is the value of being able to change a specific backend's stat
interval? I don't remember any other setting having this ability.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Thank you for the comments.
At Wed, 23 Jan 2019 18:21:45 -0500, Bruce Momjian <bruce@momjian.us> wrote in <20190123232145.GA8334@momjian.us>
On Wed, Jan 23, 2019 at 05:35:02PM +0900, Kyotaro HORIGUCHI wrote:
At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp>
An option is an additional PGPROC member and interface functions.
struct PGPROC
{
...
int syscahe_usage_track_interval; /* track interval, 0 to disable */=# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
=# select syscahce_usage_track_remove(2134);Or, just provide an one-shot triggering function.
=# select syscahce_take_usage_track(<pid>);
This can use both a similar PGPROC variable or SendProcSignal()
but the former doesn't fire while idle time unless using timer.The attached is revised version of this patchset, where the third
patch is the remote setting feature. It uses static shared memory.=# select pg_backend_catcache_stats(<pid>, <millis>);
Activates or changes catcache stats feature on the backend with
PID. (The name should be changed to .._syscache_stats, though.)
It is far smaller than the remote-GUC feature. (It contains a
part that should be in the previous patch. I will fix it later.)I have a few questions to make sure we have not made the API too
complex. First, for syscache_prune_min_age, that is the minimum age
that we prune, and entries could last twice that long. Is there any
value to doing the scan at 50% of the age so that the
syscache_prune_min_age is the max age? For example, if our age cutoff
is 10 minutes, we could scan every 5 minutes so 10 minutes would be the
maximum age kept.
(Looking into the patch..) Actually thrice, not twice. It is
because I put significance on the access frequency. I think it is
reasonable that the entries with more frequent access gets longer
life (within a certain limit). The original problem here was
negative caches that are created but never accessed. However,
there's no firm reason for the number of the steps (3). There
might be no difference if the extra life time were up to once of
s_p_m_age or even with no extra time.
Second, when would you use syscache_memory_target != 0?
It is a suggestion upthread, we sometimes want to keep some known
amount of caches despite that expration should be activated.
If you had
syscache_prune_min_age really fast, e.g. 10 seconds? What is the
use-case for this? You have a query that touches 10k objects, and then
the connection stays active but doesn't touch many of those 10k objects,
and you want it cleaned up in seconds instead of minutes? (I can't see
why you would not clean up all unreferenced objects after _minutes_ of
disuse, but removing them after seconds of disuse seems undesirable.)
What are the odds you would retain the entires you want with a fast
target?
Do you asking the reason for the unit? It's just because it won't
be so large even in seconds, to the utmost 3600 seconds. Even
though I don't think such a short dutaion setting is meaningful
in the real world, either I don't think we need to inhibit
that. (Actually it is useful for testing:p) Another reason is
that GUC_UNIT_MIN doesn't seem so common that it is used only by
two variables, log_rotation_age and old_snapshot_threshold.
What is the value of being able to change a specific backend's stat
interval? I don't remember any other setting having this ability.
As mentioned upthread, it takes significant time to take
statistics so I believe no one is willing to turn it on at all
times. As the result it should be useless because it cannot be
turned on on an active backend when it actually gets bloat. So I
wanted to provide a remote switching feture.
I also thought that there's some other features that is useful if
it could be turned on remotely so the remote GUC feature but it
was too complex...
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote:
Second, when would you use syscache_memory_target != 0?
It is a suggestion upthread, we sometimes want to keep some known
amount of caches despite that expration should be activated.If you had
syscache_prune_min_age really fast, e.g. 10 seconds? What is the
use-case for this? You have a query that touches 10k objects, and then
the connection stays active but doesn't touch many of those 10k objects,
and you want it cleaned up in seconds instead of minutes? (I can't see
why you would not clean up all unreferenced objects after _minutes_ of
disuse, but removing them after seconds of disuse seems undesirable.)
What are the odds you would retain the entires you want with a fast
target?Do you asking the reason for the unit? It's just because it won't
be so large even in seconds, to the utmost 3600 seconds. Even
though I don't think such a short dutaion setting is meaningful
in the real world, either I don't think we need to inhibit
that. (Actually it is useful for testing:p) Another reason is
We have gone from ignoring the cache bloat problem to designing an API
that even we don't know what value they provide, and if we don't know,
we can be sure our users will not know. Every GUC has a cost, even if
it is not used.
I suggest you go with just syscache_prune_min_age, get that into PG 12,
and we can then reevaluate what we need. If you want to hard-code a
minimum cache size where no pruning will happen, maybe based on the system
catalogs or typical load, that is fine.
that GUC_UNIT_MIN doesn't seem so common that it is used only by
two variables, log_rotation_age and old_snapshot_threshold.What is the value of being able to change a specific backend's stat
interval? I don't remember any other setting having this ability.As mentioned upthread, it takes significant time to take
statistics so I believe no one is willing to turn it on at all
times. As the result it should be useless because it cannot be
turned on on an active backend when it actually gets bloat. So I
wanted to provide a remote switching feture.I also thought that there's some other features that is useful if
it could be turned on remotely so the remote GUC feature but it
was too complex...
Well, I am thinking if we want to do something like this, we should do
it for all GUCs, not just for this one, so I suggest we not do this now
either.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote:
I also thought that there's some other features that is useful if
it could be turned on remotely so the remote GUC feature but it
was too complex...
Well, I am thinking if we want to do something like this, we should do
it for all GUCs, not just for this one, so I suggest we not do this now
either.
I will argue hard that we should not do it at all, ever.
There is already a mechanism for broadcasting global GUC changes:
apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
I do not think we need something that can remotely change a GUC's
value in just one session. The potential for bugs, misuse, and
just plain confusion is enormous, and the advantage seems minimal.
regards, tom lane
On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote:
I also thought that there's some other features that is useful if
it could be turned on remotely so the remote GUC feature but it
was too complex...Well, I am thinking if we want to do something like this, we should do
it for all GUCs, not just for this one, so I suggest we not do this now
either.I will argue hard that we should not do it at all, ever.
There is already a mechanism for broadcasting global GUC changes:
apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
I do not think we need something that can remotely change a GUC's
value in just one session. The potential for bugs, misuse, and
just plain confusion is enormous, and the advantage seems minimal.
I think there might be some merit in being able to activate debugging
or tracing facilities for a particular session remotely, but designing
something that will do that sort of thing well seems like a very
complex problem that certainly should not be sandwiched into another
patch that is mostly about something else. And if we ever get such a
thing I suspect it should be entirely separate from the GUC system.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Robert Haas [mailto:robertmhaas@gmail.com]
On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I will argue hard that we should not do it at all, ever.
There is already a mechanism for broadcasting global GUC changes:
apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
I do not think we need something that can remotely change a GUC's
value in just one session. The potential for bugs, misuse, and
just plain confusion is enormous, and the advantage seems minimal.I think there might be some merit in being able to activate debugging
or tracing facilities for a particular session remotely, but designing
something that will do that sort of thing well seems like a very
complex problem that certainly should not be sandwiched into another
patch that is mostly about something else. And if we ever get such a
thing I suspect it should be entirely separate from the GUC system.
+1 for a separate patch for remote session configuration. ALTER SYSTEM + SIGHUP targeted at a particular backend would do if the DBA can log into the database server (so, it can't be used for DBaaS.) It would be useful to have pg_reload_conf(pid).
Regards
Takayuki Tsunakawa
Hi Horiguchi-san, Bruce,
From: Bruce Momjian [mailto:bruce@momjian.us]
I suggest you go with just syscache_prune_min_age, get that into PG 12,
and we can then reevaluate what we need. If you want to hard-code a
minimum cache size where no pruning will happen, maybe based on the system
catalogs or typical load, that is fine.
Please forgive me if I say something silly (I might have got lost.)
Are you suggesting to make the cache size limit system-defined and uncontrollable by the user? I think it's necessary for the DBA to be able to control the cache memory amount. Otherwise, if many concurrent connections access many partitions within a not-so-long duration, then the cache eviction can't catch up and ends up in OOM. How about the following questions I asked in my previous mail?
--------------------------------------------------
This is a pure question. How can we answer these questions from users?
* What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100?
* How much RAM do I need to have for the caches when I set cache_memory_target = 1M?
The user tends to estimate memory to avoid OOM.
--------------------------------------------------
Regards
Takayuki Tsunakawa
On Fri, Jan 25, 2019 at 08:14:19AM +0000, Tsunakawa, Takayuki wrote:
Hi Horiguchi-san, Bruce,
From: Bruce Momjian [mailto:bruce@momjian.us]
I suggest you go with just syscache_prune_min_age, get that into
PG 12, and we can then reevaluate what we need. If you want to
hard-code a minimum cache size where no pruning will happen, maybe
based on the system catalogs or typical load, that is fine.Please forgive me if I say something silly (I might have got lost.)
Are you suggesting to make the cache size limit system-defined and
uncontrollable by the user? I think it's necessary for the DBA to
be able to control the cache memory amount. Otherwise, if many
concurrent connections access many partitions within a not-so-long
duration, then the cache eviction can't catch up and ends up in OOM.
How about the following questions I asked in my previous mail?----------------------------------------------------------------------
This is a pure question. How can we answer these questions from
users?* What value can I set to cache_memory_target when I can use 10 GB for
* the caches and max_connections = 100? How much RAM do I need to
* have for the caches when I set cache_memory_target = 1M?The user tends to estimate memory to avoid OOM.
Well, let's walk through this. Suppose the default for
syscache_prune_min_age is 10 minutes, and that we prune all cache
entries unreferenced in the past 10 minutes, or we only prune every 10
minutes if the cache size is larger than some fixed size like 100.
So, when would you change syscache_prune_min_age? If you reference many
objects and then don't reference them at all for minutes, you might want
to lower syscache_prune_min_age to maybe 1 minute. Why would you want
to change the behavior of removing all unreferenced cache items, at
least when there are more than 100? (You called this
syscache_memory_target.)
My point is I can see someone wanting to change syscache_prune_min_age,
but I can't see someone wanting to change syscache_memory_target. Who
would want to keep 5k cache entries that have not been accessed in X
minutes? If we had some global resource manager that would allow you to
control work_mem, maintenance_work_mem, cache size, and set global
limits on their sizes, I can see where maybe it might make sense, but
right now the memory usage of a backend is so fluid that setting some
limit on its size for unreferenced entries just doesn't make sense.
One of my big points is that syscache_memory_target doesn't even
guarantee that the cache will be this size or lower, it only controls
whether the cleanup happens at syscache_prune_min_age intervals.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
At Fri, 25 Jan 2019 08:14:19 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB70EFB@G01JPEXMBYT05>
Hi Horiguchi-san, Bruce,
From: Bruce Momjian [mailto:bruce@momjian.us]
I suggest you go with just syscache_prune_min_age, get that into PG 12,
and we can then reevaluate what we need. If you want to hard-code a
minimum cache size where no pruning will happen, maybe based on the system
catalogs or typical load, that is fine.Please forgive me if I say something silly (I might have got lost.)
Are you suggesting to make the cache size limit system-defined and uncontrollable by the user? I think it's necessary for the DBA to be able to control the cache memory amount. Otherwise, if many concurrent connections access many partitions within a not-so-long duration, then the cache eviction can't catch up and ends up in OOM. How about the following questions I asked in my previous mail?
cache_memory_target does the opposit of limiting memory usage. It
keeps some amount of syscahe entries unpruned. It is intended for
sessions on where cache-effective queries runs intermittently.
syscache_prune_min_age also doesn't directly limit the size. It
just eventually prevents infinite memory consumption.
The knobs are not no-brainer at all and don't need tuning in most
cases.
--------------------------------------------------
This is a pure question. How can we answer these questions from users?* What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100?
* How much RAM do I need to have for the caches when I set cache_memory_target = 1M?The user tends to estimate memory to avoid OOM.
--------------------------------------------------
You don't have a direct control on syscache memory usage. When
you find a queriy slowed by the default cache expiration, you can
set cache_memory_taret to keep them for intermittent execution of
a query, or you can increase syscache_prune_min_age to allow
cache live for a longer time.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 25 Jan 2019 07:26:46 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB70E6B@G01JPEXMBYT05>
From: Robert Haas [mailto:robertmhaas@gmail.com]
On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I will argue hard that we should not do it at all, ever.
There is already a mechanism for broadcasting global GUC changes:
apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
I do not think we need something that can remotely change a GUC's
value in just one session. The potential for bugs, misuse, and
just plain confusion is enormous, and the advantage seems minimal.I think there might be some merit in being able to activate debugging
or tracing facilities for a particular session remotely, but designing
something that will do that sort of thing well seems like a very
complex problem that certainly should not be sandwiched into another
patch that is mostly about something else. And if we ever get such a
thing I suspect it should be entirely separate from the GUC system.
It means that we have a lesser copy of the GUC system but can be
set remotely, then some features explicitly register their own
knob on the new system, with the name that I suspenct it should
be the same to the related GUC (for users' convenient).
+1 for a separate patch for remote session configuration.
It sounds reasnable for me. As I said there should be some such
variables.
ALTER SYSTEM + SIGHUP targeted at a particular backend would do
if the DBA can log into the database server (so, it can't be
used for DBaaS.) It would be useful to have
pg_reload_conf(pid).
I don't think it is reasonable. ALTER SYSTEM alters a *system*
configuration which is assumed to be the same on all sessions and
other processes. All sessions start the syscache tracking if
another ALTER SYSTEM for another variable then pg_reload_conf()
come after doing the above. I think the change should persist no
longer than the session-lifetime.
I think that a consensus on backend-targetted remote tuning is
made here:)
A. Let GUC variables settable by a remote session.
A-1. Variables are changed at a busy time (my first patch).
(transaction-awareness of GUC makes this complex)
A-2. Variables are changed when the session is idle (or outside
a transaction).
B. Override some variables via values laid on shared memory. (my
second or the last patch).
Very specific to a target feature. I think it consumes a bit
too large memory.
C. Provide session-specific GUC variable (that overides the global one)
- Add new configuration file "postgresql.conf.<PID>" and
pg_reload_conf() let the session with the PID loads it as if
it is the last included file. All such files are removed at
startup or at the end of the coressponding session.
- Add a new syntax like this:
ALTER SESSION WITH (pid=xxxx)
SET configuration_parameter {TO | =} {value | 'value' | DEFAULT}
RESET configuration_parameter
RESET ALL
- Target variables are marked with GUC_REMOTE.
I'll consider the last choice and will come up with a patch.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi
I suggest you go with just syscache_prune_min_age, get that into PG
12, and we can then reevaluate what we need. If you want to
hard-code a minimum cache size where no pruning will happen, maybe
based on the system catalogs or typical load, that is fine.Please forgive me if I say something silly (I might have got lost.)
Are you suggesting to make the cache size limit system-defined and uncontrollable
by the user? I think it's necessary for the DBA to be able to control the cache memory
amount. Otherwise, if many concurrent connections access many partitions within a
not-so-long duration, then the cache eviction can't catch up and ends up in OOM.
How about the following questions I asked in my previous mail?cache_memory_target does the opposit of limiting memory usage. It keeps some
amount of syscahe entries unpruned. It is intended for sessions on where
cache-effective queries runs intermittently.
syscache_prune_min_age also doesn't directly limit the size. It just eventually
prevents infinite memory consumption.The knobs are not no-brainer at all and don't need tuning in most cases.
--------------------------------------------------
This is a pure question. How can we answer these questions from users?* What value can I set to cache_memory_target when I can use 10 GB for the
caches and max_connections = 100?
* How much RAM do I need to have for the caches when I set cache_memory_target
= 1M?
The user tends to estimate memory to avoid OOM.
--------------------------------------------------You don't have a direct control on syscache memory usage. When you find a queriy
slowed by the default cache expiration, you can set cache_memory_taret to keep
them for intermittent execution of a query, or you can increase
syscache_prune_min_age to allow cache live for a longer time.
In current ver8 patch there is a stats view representing age class distribution.
/messages/by-id/20181019.173457.68080786.horiguchi.kyotaro@lab.ntt.co.jp
Does it help DBA with tuning cache_prune_age and/or cache_prune_target?
If the amount of cache entries of older age class is large, are people supposed to lower prune_age and
not to change cache_prune_target?
(I get confusion a little bit.)
Regards,
Takeshi Ideriha
At Wed, 30 Jan 2019 05:06:30 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F4156D4@G01JPEXMBKW04>
You don't have a direct control on syscache memory usage. When you find a queriy
slowed by the default cache expiration, you can set cache_memory_taret to keep
them for intermittent execution of a query, or you can increase
syscache_prune_min_age to allow cache live for a longer time.In current ver8 patch there is a stats view representing age class distribution.
/messages/by-id/20181019.173457.68080786.horiguchi.kyotaro@lab.ntt.co.jp
Does it help DBA with tuning cache_prune_age and/or cache_prune_target?
Definitely. At least DBA can see nothing about cache usage.
If the amount of cache entries of older age class is large, are people supposed to lower prune_age and
not to change cache_prune_target?
(I get confusion a little bit.)
This feature just removes cache entries that have not accessed
for a certain time.
If older entries occupies the major portion, it means that
syscache is used effectively (in other words most of the entries
are accessed frequently enough.) And in that case I believe
syscache doesn't put pressure to memory usage. If the total
memory usage exceeds expectations in the case, reducing pruning
age may reduce it but not necessarily. Extremely short pruning
age will work in exchange for performance degradation.
If newer entries occupies the major portion, it means that
syscache may not be used effectively. The total amount of memory
usage will be limited by puruning feature so tuning won't be
needed.
In both cases, if pruning causes slowdown of intermittent large
queries, cache_memory_target will alleviate the slowdown.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Jan 28, 2019 at 01:31:43PM +0900, Kyotaro HORIGUCHI wrote:
I'll consider the last choice and will come up with a patch.
Update is recent, so I have just moved the patch to next CF.
--
Michael
Horiguchi-san, Bruce,
Thank you for telling me your ideas behind this feature. Frankly, I don't think I understood the proposed specification is OK, but I can't explain it well at this instant. So, let me discuss that in a subsequent mail.
Anyway, here are my review comments on 0001:
(1)
(1)
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ /* initilize catcache reference clock if haven't done yet */
cosidered -> considered
initilize -> initialize
I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry).
(2)
Only the doc prefixes "sys" to the new parameter names. Other places don't have it. I think we should prefix sys, because relcache and plancache should be configurable separately because of their different usage patterns/lifecycle.
(3)
The doc doesn't describe the unit of syscache_memory_target. Kilobytes?
(4)
+ hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+ tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len;
+ tupsize = sizeof(CatCTup);
GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/ do.
(5)
+ if (entry_age > cache_prune_min_age)
">=" instead of ">"?
(6)
+ if (!ct->c_list || ct->c_list->refcount == 0)
+ {
+ CatCacheRemoveCTup(cp, ct);
It's better to write "ct->c_list == NULL" to follow the style in this file.
"ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been released for a long time, which might hardly happen.
(7)
CatalogCacheCreateEntry
+ int tupsize = 0;
if (ntp)
{
int i;
+ int tupsize;
tupsize is defined twice.
(8)
CatalogCacheCreateEntry
In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted. I'm afraid that's not negligible.
(9)
The memory for CatCList is not taken into account for syscache_memory_target.
Regards
Takayuki Tsunakawa
Horiguchi-san, Bruce, all,
I hesitate to say this, but I think there are the following problems with the proposed approach:
1) Tries to prune the catalog tuples only when the hash table is about to expand.
If no tuple is found to be eligible for eviction at first and the hash table expands, it gets difficult for unnecessary or less frequently accessed tuples to be removed because it will become longer and longer until the next hash table expansion. The hash table doubles in size each time.
For example, if many transactions are executed in a short duration that create and drop temporary tables and indexes, the hash table could become large quickly.
2) syscache_prune_min_age is difficult to set to meet contradictory requirements.
e.g., in the above temporary objects case, the user wants to shorten syscache_prune_min_age so that the catalog tuples for temporary objects are removed. But that also is likely to result in the necessary catalog tuples for non-temporary objects being removed.
3) The DBA cannot control the memory usage. It's not predictable.
syscache_memory_target doesn't set the limit on memory usage despite the impression from its name. In general, the cache should be able to set the upper limit on its size so that the DBA can manage things within a given amount of memory. I think other PostgreSQL parameters are based on that idea -- shared_buffers, wal_buffers, work_mem, temp_buffers, etc.
4) The memory usage doesn't decrease once allocated.
The normal allocation memory context, aset.c, which CacheMemoryContextuses, doesn't return pfree()d memory to the operating system. Once CacheMemoryContext becomes big, it won't get smaller.
5) Catcaches are managed independently of each other.
Even if there are many unnecessary catalog tuples in one catcache, they are not freed to make room for other catcaches.
So, why don't we make syscache_memory_target the upper limit on the total size of all catcaches, and rethink the past LRU management?
Regards
Takayuki Tsunakawa
On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
Horiguchi-san, Bruce, all, So, why don't we make
syscache_memory_target the upper limit on the total size of all
catcaches, and rethink the past LRU management?
I was going to say that our experience with LRU has been that the
overhead is not worth the value, but that was in shared resource cases,
which this is not.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
From: bruce@momjian.us [mailto:bruce@momjian.us]
On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
Horiguchi-san, Bruce, all, So, why don't we make
syscache_memory_target the upper limit on the total size of all
catcaches, and rethink the past LRU management?I was going to say that our experience with LRU has been that the
overhead is not worth the value, but that was in shared resource cases,
which this is not.
That's good news! Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san.
Regards
Takayuki Tsunakawa
From: bruce@momjian.us [mailto:bruce@momjian.us]
On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:Horiguchi-san, Bruce, all, So, why don't we make
syscache_memory_target the upper limit on the total size of all
catcaches, and rethink the past LRU management?I was going to say that our experience with LRU has been that the overhead is not
worth the value, but that was in shared resource cases, which this is not.
One idea is building list with access counter for implementing LRU list based on this current patch.
The list is ordered by last access time. When a catcache entry is referenced, the list is maintained
, which is just manipulation of pointers at several times.
As Bruce mentioned, it's not shared so there is no cost related to lock contention.
When it comes to pruning, the cache older than certain timestamp with zero access counter is pruned.
This way would improve performance because it only scans limited range (bounded by sys_cache_min_age).
Current patch scans all hash entries and check each timestamp which would decrease the performance as cache size grows.
I'm thinking hopefully implementing this idea and measuring the performance.
And when we want to set the memory size limit as Tsunakawa san said, the LRU list would be suitable.
Regards,
Takeshi Ideriha
Hi,
I find it a bit surprising there are almost no results demonstrating the
impact of the proposed changes on some typical workloads. It touches
code (syscache, ...) that is quite sensitive performance-wise, and
adding even just a little bit of overhead may hurt significantly. Even
on systems that don't have issues with cache bloat, etc.
I think this is something we need - benchmarks measuring the overhead on
a bunch of workloads (both typical and corner cases). Especially when
there was a limit on cache size in the past, and it was removed because
it was too expensive / hurting in some cases. I can't imagine committing
any such changes without this information.
This is particularly important as the patch was about one particular
issue (bloat due to negative entries) initially, but then the scope grew
quite a it. AFAICS the thread now talks about these workloads:
* negative entries (due to search_path lookups etc.)
* many tables accessed randomly
* many tables with only a small subset accessed frequently
* many tables with subsets accessed in subsets (due to pooling)
* ...
Unfortunately, some of those cases seems somewhat contradictory (i.e.
what works for one hurts the other), so I doubt it's possible to improve
all of them at once. But that makes the bencharking even more important.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/21/19 9:56 PM, Bruce Momjian wrote:
On Fri, Jan 18, 2019 at 05:09:41PM -0800, Andres Freund wrote:
Hi,
On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
My proposal for this was to attach a 'generation' to cache entries. Upon
access cache entries are marked to be of the current
generation. Whenever existing memory isn't sufficient for further cache
entries and, on a less frequent schedule, triggered by a timer, the
cache generation is increased and th new generation's "creation time" is
measured. Then generations that are older than a certain threshold are
purged, and if there are any, the entries of the purged generation are
removed from the caches using a sequential scan through the cache.This outline achieves:
- no additional time measurements in hot code paths
- no need for a sequential scan of the entire cache when no generations
are too old
- both size and time limits can be implemented reasonably cheaply
- overhead when feature disabled should be close to zeroSeems generally reasonable. The "whenever existing memory isn't
sufficient for further cache entries" part I'm not sure about.
Couldn't that trigger very frequently and prevent necessary cache size
growth?I'm thinking it'd just trigger a new generation, with it's associated
"creation" time (which is cheap to acquire in comparison to creating a
number of cache entries) . Depending on settings or just code policy we
can decide up to which generation to prune the cache, using that
creation time. I'd imagine that we'd have some default cache-pruning
time in the minutes, and for workloads where relevant one can make
sizing configurations more aggressive - or something like that.OK, so it seems everyone likes the idea of a timer. The open questions
are whether we want multiple epochs, and whether we want some kind of
size trigger.
FWIW I share the with that time-based eviction (be it some sort of
timestamp or epoch) seems promising, seems cheaper than pretty much any
other LRU metric (requiring usage count / clock sweep / ...).
With only one time epoch, if the timer is 10 minutes, you could expire an
entry after 10-19 minutes, while with a new epoch every minute and
10-minute expire, you can do 10-11 minute precision. I am not sure the
complexity is worth it.
I don't think having just a single epoch would be significantly less
complex than having more of them. In fact, having more of them might
make it actually cheaper.
For a size trigger, should removal be effected by how many expired cache
entries there are? If there were 10k expired entries or 50, wouldn't
you want them removed if they have not been accessed in X minutes?In the worst case, if 10k entries were accessed in a query and never
accessed again, what would the ideal cleanup behavior be? Would it
matter if it was expired in 10 or 19 minutes? Would it matter if there
were only 50 entries?
I don't think we need to remove the expired entries right away, if there
are only very few of them. The cleanup requires walking the hash table,
which means significant fixed cost. So if there are only few expired
entries (say, less than 25% of the cache), we can just leave them around
and clean them if we happen to stumble on them (although that may not be
possible with dynahash, which has no concept of expiration) of before
enlarging the hash table.
FWIW when it comes to memory consumption, it's important to realize the
cache memory context won't release the memory to the system, even if we
remove the expired entries. It'll simply stash them into a freelist.
That's OK when the entries are to be reused, but the memory usage won't
decrease after a sudden spike for example (and there may be other chunks
allocated on the same page, so paging it out will hurt).
So if we want to address this case too (and we probably want), we may
need to discard the old cache memory context someho (e.g. rebuild the
cache in a new one, and copy the non-expired entries). Which is a nice
opportunity to do the "full" cleanup, of course.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Feb-05, Tomas Vondra wrote:
I don't think we need to remove the expired entries right away, if there
are only very few of them. The cleanup requires walking the hash table,
which means significant fixed cost. So if there are only few expired
entries (say, less than 25% of the cache), we can just leave them around
and clean them if we happen to stumble on them (although that may not be
possible with dynahash, which has no concept of expiration) of before
enlarging the hash table.
I think seqscanning the hash table is going to be too slow; Ideriha-san
idea of having a dlist with the entries in LRU order (where each entry
is moved to head of list when it is touched) seemed good: it allows you
to evict older ones when the time comes, without having to scan the rest
of the entries. Having a dlist means two more pointers on each cache
entry AFAIR, so it's not a huge amount of memory.
So if we want to address this case too (and we probably want), we may
need to discard the old cache memory context someho (e.g. rebuild the
cache in a new one, and copy the non-expired entries). Which is a nice
opportunity to do the "full" cleanup, of course.
Yeah, we probably don't want to do this super frequently though.
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/5/19 11:05 PM, Alvaro Herrera wrote:
On 2019-Feb-05, Tomas Vondra wrote:
I don't think we need to remove the expired entries right away, if there
are only very few of them. The cleanup requires walking the hash table,
which means significant fixed cost. So if there are only few expired
entries (say, less than 25% of the cache), we can just leave them around
and clean them if we happen to stumble on them (although that may not be
possible with dynahash, which has no concept of expiration) of before
enlarging the hash table.I think seqscanning the hash table is going to be too slow; Ideriha-san
idea of having a dlist with the entries in LRU order (where each entry
is moved to head of list when it is touched) seemed good: it allows you
to evict older ones when the time comes, without having to scan the rest
of the entries. Having a dlist means two more pointers on each cache
entry AFAIR, so it's not a huge amount of memory.
Possibly, although my guess is it will depend on the number of entries
to remove. For small number of entries, the dlist approach is going to
be faster, but at some point the bulk seqscan gets more efficient.
FWIW this is exactly where a bit of benchmarking would help.
So if we want to address this case too (and we probably want), we may
need to discard the old cache memory context someho (e.g. rebuild the
cache in a new one, and copy the non-expired entries). Which is a nice
opportunity to do the "full" cleanup, of course.Yeah, we probably don't want to do this super frequently though.
Right. I've also realized the resizing is built into dynahash and is
kinda incremental - we add (and split) buckets one by one, instead of
immediately rebuilding the whole hash table. So yes, this would need
more care and might need to interact with dynahash in some way.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Tue, 5 Feb 2019 02:40:35 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05>
From: bruce@momjian.us [mailto:bruce@momjian.us]
On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
Horiguchi-san, Bruce, all, So, why don't we make
syscache_memory_target the upper limit on the total size of all
catcaches, and rethink the past LRU management?I was going to say that our experience with LRU has been that the
overhead is not worth the value, but that was in shared resource cases,
which this is not.That's good news! Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san.
If you mean accessed-time-ordered list of entries by "LRU", I
still object to involve it since it is too complex in searching
code paths. Invalidation would make things more complex. The
current patch sorts entries by ct->lastaccess and discards
entries not accessed for more than threshold, only at doubling
cache capacity. It is already a kind of LRU in behavior.
This patch intends not to let caches bloat by unnecessary
entries, which is negative ones at first, then less-accessed ones
currently. If you mean by "LRU" something to put a hard limit on
the number or size of a catcache or all caches, it would be
doable by adding sort phase before pruning, like
CatCacheCleanOldEntriesByNum() in the attached as a PoC (first
attched) as food for discussion.
With the second attached script, we can observe what is happening
from another session by the following query.
select relname, size, ntuples, ageclass from pg_stat_syscache where relname =' pg_statistic'::regclass;
pg_statistic | 1041024 | 7109 | {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0
On the other hand, differently from the original pruning, this
happens irrelevantly to hash resize so it will causes another
observable intermittent slowdown than rehashing.
The two should have the same extent of impact on performance when
disabled. I'll take numbers briefly using pgbench.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Wed, 06 Feb 2019 14:43:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190206.144334.193118280.horiguchi.kyotaro@lab.ntt.co.jp>
At Tue, 5 Feb 2019 02:40:35 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05>
From: bruce@momjian.us [mailto:bruce@momjian.us]
On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
Horiguchi-san, Bruce, all, So, why don't we make
syscache_memory_target the upper limit on the total size of all
catcaches, and rethink the past LRU management?I was going to say that our experience with LRU has been that the
overhead is not worth the value, but that was in shared resource cases,
which this is not.That's good news! Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san.
If you mean accessed-time-ordered list of entries by "LRU", I
still object to involve it since it is too complex in searching
code paths. Invalidation would make things more complex. The
current patch sorts entries by ct->lastaccess and discards
entries not accessed for more than threshold, only at doubling
cache capacity. It is already a kind of LRU in behavior.This patch intends not to let caches bloat by unnecessary
entries, which is negative ones at first, then less-accessed ones
currently. If you mean by "LRU" something to put a hard limit on
the number or size of a catcache or all caches, it would be
doable by adding sort phase before pruning, like
CatCacheCleanOldEntriesByNum() in the attached as a PoC (first
attched) as food for discussion.With the second attached script, we can observe what is happening
from another session by the following query.select relname, size, ntuples, ageclass from pg_stat_syscache where relname =' pg_statistic'::regclass;
pg_statistic | 1041024 | 7109 | {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0
On the other hand, differently from the original pruning, this
happens irrelevantly to hash resize so it will causes another
observable intermittent slowdown than rehashing.The two should have the same extent of impact on performance when
disabled. I'll take numbers briefly using pgbench.
Sorry, I forgot to consider references in the previous patch, and
attach the test script.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Wed, 06 Feb 2019 15:16:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190206.151653.117382256.horiguchi.kyotaro@lab.ntt.co.jp>
The two should have the same extent of impact on performance when
disabled. I'll take numbers briefly using pgbench.
(pgbench -j 10 -c 10 -T 120) x 5 times for each.
A: unpached : 118.58 tps (stddev 0.44)
B: pached-not-used[1]cache_prune_min_age = 0, cache_entry_limit = 0 : 118.41 tps (stddev 0.29)
C: patched-timedprune[2]cache_prune_min_age = 100, cache_entry_limit = 0 (Prunes every 100ms): 118.41 tps (stddev 0.51)
D: patched-capped...... : none[3]I didin't find a sane benchmark for the capping case using vanilla pgbench.
[1]: cache_prune_min_age = 0, cache_entry_limit = 0
[2]: cache_prune_min_age = 100, cache_entry_limit = 0 (Prunes every 100ms)
(Prunes every 100ms)
[3]: I didin't find a sane benchmark for the capping case using vanilla pgbench.
vanilla pgbench.
It doesn't seem to me showing significant degradation on *my*
box...
# I found a bug that can remove newly created entry. So v11.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Import Notes
Reply to msg id not found: 20190206.151653.117382256.horiguchi.kyotaro@lab.ntt.co.jp
On 2019-02-06 17:37:04 +0900, Kyotaro HORIGUCHI wrote:
At Wed, 06 Feb 2019 15:16:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190206.151653.117382256.horiguchi.kyotaro@lab.ntt.co.jp>
The two should have the same extent of impact on performance when
disabled. I'll take numbers briefly using pgbench.(pgbench -j 10 -c 10 -T 120) x 5 times for each.
A: unpached : 118.58 tps (stddev 0.44)
B: pached-not-used[1] : 118.41 tps (stddev 0.29)
C: patched-timedprune[2]: 118.41 tps (stddev 0.51)
D: patched-capped...... : none[3][1]: cache_prune_min_age = 0, cache_entry_limit = 0
[2]: cache_prune_min_age = 100, cache_entry_limit = 0
(Prunes every 100ms)[3] I didin't find a sane benchmark for the capping case using
vanilla pgbench.It doesn't seem to me showing significant degradation on *my*
box...# I found a bug that can remove newly created entry. So v11.
This seems to just benchmark your disk speed, no? ISTM you need to
measure readonly performance, not read/write. And with plenty more
tables than just standard pgbench -S.
Greetings,
Andres Freund
At Tue, 5 Feb 2019 19:05:26 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190205220526.GA1442@alvherre.pgsql>
On 2019-Feb-05, Tomas Vondra wrote:
I don't think we need to remove the expired entries right away, if there
are only very few of them. The cleanup requires walking the hash table,
which means significant fixed cost. So if there are only few expired
entries (say, less than 25% of the cache), we can just leave them around
and clean them if we happen to stumble on them (although that may not be
possible with dynahash, which has no concept of expiration) of before
enlarging the hash table.I think seqscanning the hash table is going to be too slow; Ideriha-san
idea of having a dlist with the entries in LRU order (where each entry
is moved to head of list when it is touched) seemed good: it allows you
to evict older ones when the time comes, without having to scan the rest
of the entries. Having a dlist means two more pointers on each cache
entry AFAIR, so it's not a huge amount of memory.
Ah, I had a separate list in my mind. Sounds reasonable to have
pointers in cache entry. But I'm not sure how much additional
dlist_* impact.
The attached is the new version with the following properties:
- Both prune-by-age and hard limiting feature.
(Merged into single function, single scan)
Debug tracking feature in CatCacheCleanupOldEntries is removed
since it no longer runs a full scan.
Prune-by-age can be a single-setup-for-all-cache feature but
the hard limit is obviously not. We could use reloptions for
the purpose (which is not currently available on pg_class and
pg_attribute:p). I'll add that if there's no strong objection.
Or is there anyone comes up with something sutable for the
purpose?
- Using LRU to get rid of full scan.
I added new API dlist_move_to_tail which was needed to construct LRU.
I'm going to retake numbers with search-only queries.
So if we want to address this case too (and we probably want), we may
need to discard the old cache memory context someho (e.g. rebuild the
cache in a new one, and copy the non-expired entries). Which is a nice
opportunity to do the "full" cleanup, of course.Yeah, we probably don't want to do this super frequently though.
MemoryContext per cache?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hi, thanks for recent rapid work.
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
At Tue, 5 Feb 2019 19:05:26 -0300, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote in <20190205220526.GA1442@alvherre.pgsql>On 2019-Feb-05, Tomas Vondra wrote:
I don't think we need to remove the expired entries right away, if
there are only very few of them. The cleanup requires walking the
hash table, which means significant fixed cost. So if there are only
few expired entries (say, less than 25% of the cache), we can just
leave them around and clean them if we happen to stumble on them
(although that may not be possible with dynahash, which has no
concept of expiration) of before enlarging the hash table.I think seqscanning the hash table is going to be too slow;
Ideriha-san idea of having a dlist with the entries in LRU order
(where each entry is moved to head of list when it is touched) seemed
good: it allows you to evict older ones when the time comes, without
having to scan the rest of the entries. Having a dlist means two more
pointers on each cache entry AFAIR, so it's not a huge amount of memory.Ah, I had a separate list in my mind. Sounds reasonable to have pointers in cache entry.
But I'm not sure how much additional
dlist_* impact.
Thank you for picking up my comment, Alvaro.
That's what I was thinking about.
The attached is the new version with the following properties:
- Both prune-by-age and hard limiting feature.
(Merged into single function, single scan)
Debug tracking feature in CatCacheCleanupOldEntries is removed
since it no longer runs a full scan.
It seems to me that adding hard limit strategy choice besides prune-by-age one is good
to help variety of (contradictory) cases which have been discussed in this thread. I need hard limit as well.
The hard limit is currently represented as number of cache entry
controlled by both cache_entry_limit and cache_entry_limit_prune_ratio.
Why don't we change it to the amount of memory (bytes)?
Amount of memory is more direct parameter for customer who wants to
set the hard limit and easier to tune compared to number of cache entry.
- Using LRU to get rid of full scan.
I added new API dlist_move_to_tail which was needed to construct LRU.
I just thought there is dlist_move_head() so if new entries are
head side and old ones are tail side. But that's not objection to adding
new API because depending on the situation head for new entry could be readable code
and vice versa.
Regards,
Takeshi Ideriha
At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
I'm going to retake numbers with search-only queries.
Yeah, I was stupid.
I made a rerun of benchmark using "-S -T 30" on the server build
with no assertion and -O2. The numbers are the best of three
successive attempts. The patched version is running with
cache_target_memory = 0, cache_prune_min_age = 600 and
cache_entry_limit = 0 but pruning doesn't happen by the workload.
master: 13393 tps
v12 : 12625 tps (-6%)
Significant degradation is found.
Recuded frequency of dlist_move_tail by taking 1ms interval
between two succesive updates on the same entry let the
degradation dissapear.
patched : 13720 tps (+2%)
I think there's still no need of such frequency. It is 100ms in
the attched patch.
# I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
The attached
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
I made a rerun of benchmark using "-S -T 30" on the server build with no assertion and
-O2. The numbers are the best of three successive attempts. The patched version is
running with cache_target_memory = 0, cache_prune_min_age = 600 and
cache_entry_limit = 0 but pruning doesn't happen by the workload.master: 13393 tps
v12 : 12625 tps (-6%)Significant degradation is found.
Recuded frequency of dlist_move_tail by taking 1ms interval between two succesive
updates on the same entry let the degradation dissapear.patched : 13720 tps (+2%)
It would be good to introduce some interval.
I followed your benchmark (initialized scale factor=10 and others are same option)
and found the same tendency.
These are average of 5 trials.
master: 7640.000538
patch_v12:7417.981378 (3 % down against master)
patch_v13:7645.071787 (almost same as master)
These cases are not pruning happen workload as you mentioned.
I'd like to do benchmark of cache-pruning-case as well.
To demonstrate cache-pruning-case
right now I'm making hundreds of partitioned table and run select query for each partitioned table
using pgbench custom file. Maybe using small number of cache_prune_min_age or hard limit would be better.
Are there any good model?
# I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
How about MIN_LRU_UPDATE_INTERVAL?
Regards,
Takeshi Ideriha
From: Tomas Vondra
I don't think we need to remove the expired entries right away, if
there
are only very few of them. The cleanup requires walking the hash
table,
which means significant fixed cost. So if there are only few expired
entries (say, less than 25% of the cache), we can just leave them
around
and clean them if we happen to stumble on them (although that may
not be
possible with dynahash, which has no concept of expiration) of
before
enlarging the hash table.
I agree in that we don't need to evict cache entries as long as the
memory permits (within the control of the DBA.)
But how does the concept of expiration fit the catcache? How would
the user determine the expiration time, i.e. setting of
syscache_prune_min_age? If you set a small value to evict unnecessary
entries faster, necessary entries will also be evicted. Some access
counter would keep accessed entries longer, but some idle time (e.g.
lunch break) can flush entries that you want to access after the lunch
break.
The idea of expiration applies to the case where we want possibly
stale entries to vanish and load newer data upon the next access. For
example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP. Is the
catcache based on the same idea with them? No.
What we want to do is to evict never or infrequently used cache
entries. That's naturally the task of LRU, isn't it? Even the high
performance Memcached and Redis uses LRU when the cache is full. As
Bruce said, we don't have to be worried about the lock contention or
something, because we're talking about the backend local cache. Are
we worried about the overhead of manipulating the LRU chain? The
current catcache already does it on every access; it calls
dlist_move_head() to put the accessed entry to the front of the hash
bucket.
So if we want to address this case too (and we probably want), we
may
need to discard the old cache memory context someho (e.g. rebuild
the
cache in a new one, and copy the non-expired entries). Which is a
nice
opportunity to do the "full" cleanup, of course.
The straightforward, natural, and familiar way is to limit the cache
size, which I mentioned in some previous mail. We should give the DBA
the ability to control memory usage, rather than considering what to
do after leaving the memory area grow unnecessarily too large. That's
what a typical "cache" is, isn't it?
https://en.wikipedia.org/wiki/Cache_(computing)
"To be cost-effective and to enable efficient use of data, caches must
be relatively small."
Another relevant suboptimal idea would be to provide each catcache
with a separate memory context, which is the child of
CacheMemoryContext. This gives slight optimization by using the slab
context (slab.c) for a catcache with fixed-sized tuples. But that'd
be a bit complex, I'm afraid for PG 12.
Regards
MauMau
From: Alvaro Herrera
I think seqscanning the hash table is going to be too slow;
Ideriha-san
idea of having a dlist with the entries in LRU order (where each
entry
is moved to head of list when it is touched) seemed good: it allows
you
to evict older ones when the time comes, without having to scan the
rest
of the entries. Having a dlist means two more pointers on each
cache
entry AFAIR, so it's not a huge amount of memory.
Absolutely. We should try to avoid unpredictable long response time
caused by an occasional unlucky batch processing. That makes the
troubleshooting when the user asks why they experience unsteady
response time.
Regards
MauMau
On 2/8/19 2:27 PM, MauMau wrote:
From: Tomas Vondra
I don't think we need to remove the expired entries right away, if
there are only very few of them. The cleanup requires walking the
hash table, which means significant fixed cost. So if there are
only few expired entries (say, less than 25% of the cache), we can
just leave them around and clean them if we happen to stumble on
them (although that may not be possible with dynahash, which has no
concept of expiration) of before enlarging the hash table.I agree in that we don't need to evict cache entries as long as the
memory permits (within the control of the DBA.)But how does the concept of expiration fit the catcache? How would
the user determine the expiration time, i.e. setting of
syscache_prune_min_age? If you set a small value to evict
unnecessary entries faster, necessary entries will also be evicted.
Some access counter would keep accessed entries longer, but some idle
time (e.g. lunch break) can flush entries that you want to access
after the lunch break.
I'm not sure what you mean by "necessary" and "unnecessary" here. What
matters is how often an entry is accessed - if it's accessed often, it
makes sense to keep it in the cache. Otherwise evict it. Entries not
accessed for 5 minutes are clearly not accessed very often, so and
getting rid of them will not hurt the cache hit ratio very much.
So I agree with Robert a time-based approach should work well here. It
does not have the issues with setting exact syscache size limit, it's
kinda self-adaptive etc.
In a way, this is exactly what the 5 minute rule [1]http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf says about caching.
[1]: http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf
The idea of expiration applies to the case where we want possibly
stale entries to vanish and load newer data upon the next access.
For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.
Is the catcache based on the same idea with them? No.
I'm not sure what has this to do with those other databases.
What we want to do is to evict never or infrequently used cache
entries. That's naturally the task of LRU, isn't it? Even the high
performance Memcached and Redis uses LRU when the cache is full. As
Bruce said, we don't have to be worried about the lock contention or
something, because we're talking about the backend local cache. Are
we worried about the overhead of manipulating the LRU chain? The
current catcache already does it on every access; it calls
dlist_move_head() to put the accessed entry to the front of the hash
bucket.
I'm certainly worried about the performance aspect of it. The syscache
is in a plenty of hot paths, so adding overhead may have significant
impact. But that depends on how complex the eviction criteria will be.
And then there may be cases conflicting with the criteria, i.e. running
into just-evicted entries much more often. This is the issue with the
initially proposed hard limits on cache sizes, where it'd be trivial to
under-size it just a little bit.
So if we want to address this case too (and we probably want), we
may need to discard the old cache memory context somehow (e.g.
rebuild the cache in a new one, and copy the non-expired entries).
Which is a nice opportunity to do the "full" cleanup, of course.The straightforward, natural, and familiar way is to limit the cache
size, which I mentioned in some previous mail. We should give the
DBA the ability to control memory usage, rather than considering what
to do after leaving the memory area grow unnecessarily too large.
That's what a typical "cache" is, isn't it?
Not sure which mail you're referring to - this seems to be the first
e-mail from you in this thread (per our archives).
I personally don't find explicit limit on cache size very attractive,
because it's rather low-level and difficult to tune, and very easy to
get it wrong (at which point you fall from a cliff). All the information
is in backend private memory, so how would you even identify syscache is
the thing you need to tune, or how would you determine the correct size?
https://en.wikipedia.org/wiki/Cache_(computing)
"To be cost-effective and to enable efficient use of data, caches must
be relatively small."
Relatively small compared to what? It's also a question of how expensive
cache misses are.
Another relevant suboptimal idea would be to provide each catcache
with a separate memory context, which is the child of
CacheMemoryContext. This gives slight optimization by using the slab
context (slab.c) for a catcache with fixed-sized tuples. But that'd
be a bit complex, I'm afraid for PG 12.
I don't know, but that does not seem very attractive. Each memory
context has some overhead, and it does not solve the issue of never
releasing memory to the OS. So we'd still have to rebuild the contexts
at some point, I'm afraid.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
I'm going to retake numbers with search-only queries.
Yeah, I was stupid.
I made a rerun of benchmark using "-S -T 30" on the server build
with no assertion and -O2. The numbers are the best of three
successive attempts. The patched version is running with
cache_target_memory = 0, cache_prune_min_age = 600 and
cache_entry_limit = 0 but pruning doesn't happen by the workload.master: 13393 tps
v12 : 12625 tps (-6%)Significant degradation is found.
Recuded frequency of dlist_move_tail by taking 1ms interval
between two succesive updates on the same entry let the
degradation dissapear.patched : 13720 tps (+2%)
I think there's still no need of such frequency. It is 100ms in
the attched patch.# I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
Hi,
I've done a bunch of benchmarks on v13, and I don't see any serious
regression either. Each test creates a number of tables (100, 1k, 10k,
100k and 1M) and then runs SELECT queries on them. The tables are
accessed randomly - with either uniform or exponential distribution. For
each combination there are 5 runs, 60 seconds each (see the attached
shell scripts, it should be pretty obvious).
I've done the tests on two different machines - small one (i5 with 8GB
of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
almost exactly the same (with the exception of 1M tables, which does not
fit into RAM on the smaller one).
On the xeon, the results (throughput compared to master) look like this:
uniform 100 1000 10000 100000 1000000
------------------------------------------------------------
v13 105.04% 100.28% 102.96% 102.11% 101.54%
v13 (nodata) 97.05% 98.30% 97.42% 96.60% 107.55%
exponential 100 1000 10000 100000 1000000
------------------------------------------------------------
v13 100.04% 103.48% 101.70% 98.56% 103.20%
v13 (nodata) 97.12% 98.43% 98.86% 98.48% 104.94%
The "nodata" case means the tables were empty (so no files created),
while in the other case each table contained 1 row.
Per the results it's mostly break even, and in some cases there is
actually a measurable improvement.
That being said, the question is whether the patch actually reduces
memory usage in a useful way - that's not something this benchmark
validates. I plan to modify the tests to make pgbench script
time-dependent (i.e. to pick a subset of tables depending on time).
A couple of things I've happened to notice during a quick review:
1) The sgml docs in 0002 talk about "syscache_memory_target" and
"syscache_prune_min_age", but those options were renamed to just
"cache_memory_target" and "cache_prune_min_age".
2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
defined three times in guc.c for some reason.
3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
of just using two bool variables prune_by_age and prune_by_number doing
the same thing.
4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
pretty much mean long-running statements will set the lastaccess to very
old timestamp? Also, it means that long-running statements (like a PL
function accessing a bunch of tables) won't do any eviction at all, no?
AFAICS we'll set the timestamp only once, at the very beginning.
I wonder whether using some other timestamp source (like a timestamp
updated regularly from a timer, or something like that).
5) There are two fread() calls in 0003 triggering a compiler warning
about unused return value.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
I'm not sure what you mean by "necessary" and "unnecessary" here. What
matters is how often an entry is accessed - if it's accessed often, it makes sense
to keep it in the cache. Otherwise evict it. Entries not accessed for 5 minutes are
clearly not accessed very often, so and getting rid of them will not hurt the
cache hit ratio very much.
Right, "necessary" and "unnecessary" were imprecise, and it matters how frequent the entries are accessed. What made me say "unnecessary" is the pg_statistic entry left by CREATE/DROP TEMP TABLE which is never accessed again.
So I agree with Robert a time-based approach should work well here. It does
not have the issues with setting exact syscache size limit, it's kinda self-adaptive
etc.In a way, this is exactly what the 5 minute rule [1] says about caching.
Then, can we just set 5min to syscache_prune_min_age? Otherwise, how can users set the expiration period?
The idea of expiration applies to the case where we want possibly
stale entries to vanish and load newer data upon the next access.
For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.
Is the catcache based on the same idea with them? No.I'm not sure what has this to do with those other databases.
I meant that the time-based eviction is not very good, because it could cause less frequently entries to vanish even when memory is not short. Time-based eviction reminds me of Memcached, Redis, DNS, etc. that evicts long-lived entries to avoid stale data, not to free space for other entries. I think size-based eviction is sufficient like shared_buffers, OS page cache, CPU cache, disk cache, etc.
I'm certainly worried about the performance aspect of it. The syscache is in a
plenty of hot paths, so adding overhead may have significant impact. But that
depends on how complex the eviction criteria will be.
The LRU chain manipulation, dlist_move_head() in SearchCatCacheInternal(), may certainly incur some overhead. If it has visible impact, then we can do the manipulation only when the user set an upper limit on the cache size.
And then there may be cases conflicting with the criteria, i.e. running into
just-evicted entries much more often. This is the issue with the initially
proposed hard limits on cache sizes, where it'd be trivial to under-size it just a
little bit.
In that case, the user can just enlarge the catcache.
Not sure which mail you're referring to - this seems to be the first e-mail from
you in this thread (per our archives).
Sorry, MauMau is me, Takayuki Tsunakawa.
I personally don't find explicit limit on cache size very attractive, because it's
rather low-level and difficult to tune, and very easy to get it wrong (at which
point you fall from a cliff). All the information is in backend private memory, so
how would you even identify syscache is the thing you need to tune, or how
would you determine the correct size?
Just like other caches, we can present a view that shows the hits, misses, and the hit ratio of the entire catcaches. If the hit ratio is low, the user can enlarge the catcache size. That's what Oracle and MySQL do as I referred to in this thread. The tuning parameter is the size. That's all. Besides, the v13 patch has as many as 4 parameters: cache_memory_target, cache_prune_min_age, cache_entry_limit, cache_entry_limit_prune_ratio. I don't think I can give the user good intuitive advice on how to tune these.
https://en.wikipedia.org/wiki/Cache_(computing)
"To be cost-effective and to enable efficient use of data, caches must
be relatively small."Relatively small compared to what? It's also a question of how expensive cache
misses are.
I guess the author meant that the cache is "relatively small" compared to the underlying storage: CPU cache is smaller than DRAM, DRAM is smaller than SSD/HDD. In our case, we have to pay more attention to limit the catcache memory consumption, especially because they are duplicated in multiple backend processes.
I don't know, but that does not seem very attractive. Each memory context has
some overhead, and it does not solve the issue of never releasing memory to
the OS. So we'd still have to rebuild the contexts at some point, I'm afraid.
I think there is little additional overhead on each catcache access -- processing overhead is the same as when using aset, and the memory overhead is as much as several dozens (which is the number of catcaches) of MemoryContext structure. The slab context (slab.c) returns empty blocks to OS unlike the allocation context (aset.c).
Regards
Takayuki Tsunakawa
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Recuded frequency of dlist_move_tail by taking 1ms interval between two
succesive updates on the same entry let the degradation dissapear.patched : 13720 tps (+2%)
What do you think contributed to this performance increase? Or do you hink this is just a measurement variation?
Most of my previous comments also seem to apply to v13, so let me repost them below:
(1)
(1)
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ /* initilize catcache reference clock if haven't done yet */
cosidered -> considered
initilize -> initialize
I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry).
(2)
Only the doc prefixes "sys" to the new parameter names. Other places don't have it. I think we should prefix sys, because relcache and plancache should be configurable separately because of their different usage patterns/lifecycle.
(3)
The doc doesn't describe the unit of syscache_memory_target. Kilobytes?
(4)
+ hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+ tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len;
+ tupsize = sizeof(CatCTup);
GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/ do.
(5)
+ if (entry_age > cache_prune_min_age)
">=" instead of ">"?
(6)
+ if (!ct->c_list || ct->c_list->refcount == 0)
+ {
+ CatCacheRemoveCTup(cp, ct);
It's better to write "ct->c_list == NULL" to follow the style in this file.
"ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been released for a long time, which might hardly happen.
(7)
CatalogCacheCreateEntry
+ int tupsize = 0;
if (ntp)
{
int i;
+ int tupsize;
tupsize is defined twice.
(8)
CatalogCacheCreateEntry
In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted. I'm afraid that's not negligible.
(9)
The memory for CatCList is not taken into account for syscache_memory_target.
Regards
Takayuki Tsunakawa
On 2/12/19 1:49 AM, Tsunakawa, Takayuki wrote:
From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
I'm not sure what you mean by "necessary" and "unnecessary" here. What
matters is how often an entry is accessed - if it's accessed often, it makes sense
to keep it in the cache. Otherwise evict it. Entries not accessed for 5 minutes are
clearly not accessed very often, so and getting rid of them will not hurt the
cache hit ratio very much.Right, "necessary" and "unnecessary" were imprecise, and it matters
how frequent the entries are accessed. What made me say "unnecessary"
is the pg_statistic entry left by CREATE/DROP TEMP TABLE which is never
accessed again.
OK, understood.
So I agree with Robert a time-based approach should work well here. It does
not have the issues with setting exact syscache size limit, it's kinda self-adaptive
etc.In a way, this is exactly what the 5 minute rule [1] says about caching.
Then, can we just set 5min to syscache_prune_min_age? Otherwise,
how can users set the expiration period?
I believe so.
The idea of expiration applies to the case where we want possibly
stale entries to vanish and load newer data upon the next access.
For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.
Is the catcache based on the same idea with them? No.I'm not sure what has this to do with those other databases.
I meant that the time-based eviction is not very good, because it
could cause less frequently entries to vanish even when memory is not
short. Time-based eviction reminds me of Memcached, Redis, DNS, etc.
that evicts long-lived entries to avoid stale data, not to free space
for other entries. I think size-based eviction is sufficient like
shared_buffers, OS page cache, CPU cache, disk cache, etc.
Right. But the logic behind time-based approach is that evicting such
entries should not cause any issues exactly because they are accessed
infrequently. It might incur some latency when we need them for the
first time after the eviction, but IMHO that's acceptable (although I
see Andres did not like that).
FWIW we might even evict entries after some time passes since inserting
them into the cache - that's what memcached et al do, IIRC. The logic is
that frequently accessed entries will get immediately loaded back (thus
keeping cache hit ratio high). But there are reasons why the other dbs
do that - like not having any cache invalidation (unlike us).
That being said, having a "minimal size" threshold before starting with
the time-based eviction may be a good idea.
I'm certainly worried about the performance aspect of it. The syscache is in a
plenty of hot paths, so adding overhead may have significant impact. But that
depends on how complex the eviction criteria will be.The LRU chain manipulation, dlist_move_head() in
SearchCatCacheInternal(), may certainly incur some overhead. If it has
visible impact, then we can do the manipulation only when the user set
an upper limit on the cache size.
I think the benchmarks done so far suggest the extra overhead is within
noise. So unless we manage to make it much more expensive, we should be
OK I think.
And then there may be cases conflicting with the criteria, i.e. running into
just-evicted entries much more often. This is the issue with the initially
proposed hard limits on cache sizes, where it'd be trivial to under-size it just a
little bit.In that case, the user can just enlarge the catcache.
IMHO the main issues with this are
(a) It's not quite clear how to determine the appropriate limit. I can
probably apply a bit of perf+gdb, but I doubt that's what very nice.
(b) It's not adaptive, so systems that grow over time (e.g. by adding
schemas and other objects) will keep hitting the limit over and over.
Not sure which mail you're referring to - this seems to be the first e-mail from
you in this thread (per our archives).Sorry, MauMau is me, Takayuki Tsunakawa.
Ah, of course!
I personally don't find explicit limit on cache size very attractive, because it's
rather low-level and difficult to tune, and very easy to get it wrong (at which
point you fall from a cliff). All the information is in backend private memory, so
how would you even identify syscache is the thing you need to tune, or how
would you determine the correct size?Just like other caches, we can present a view that shows the hits, misses, and the hit ratio of the entire catcaches. If the hit ratio is low, the user can enlarge the catcache size. That's what Oracle and MySQL do as I referred to in this thread. The tuning parameter is the size. That's all.
How will that work, considering the caches are in private backend
memory? And each backend may have quite different characteristics, even
if they are connected to the same database?
Besides, the v13 patch has as many as 4 parameters: cache_memory_target, cache_prune_min_age, cache_entry_limit, cache_entry_limit_prune_ratio. I don't think I can give the user good intuitive advice on how to tune these.
Isn't that more an argument for not having 4 parameters?
https://en.wikipedia.org/wiki/Cache_(computing)
"To be cost-effective and to enable efficient use of data, caches must
be relatively small."Relatively small compared to what? It's also a question of how expensive cache
misses are.I guess the author meant that the cache is "relatively small" compared to the underlying storage: CPU cache is smaller than DRAM, DRAM is smaller than SSD/HDD. In our case, we have to pay more attention to limit the catcache memory consumption, especially because they are duplicated in multiple backend processes.
I don't think so. IMHO the focus there in on "cost-effective", i.e.
caches are generally more expensive than the storage, so to make them
worth it you need to make them much smaller than the main storage.
That's pretty much what the 5 minute rule is about, I think.
But I don't see how this applies to the problem at hand, because the
system is already split into storage + cache (represented by RAM). The
challenge is how to use RAM to cache various pieces of data to get the
best behavior. The problem is, we don't have a unified cache, but
multiple smaller ones (shared buffers, page cache, syscache) competing
for the same resource.
Of course, having multiple (different) copies of syscache makes it even
more difficult.
(Does this make sense, or am I just babbling nonsense?)
I don't know, but that does not seem very attractive. Each memory context has
some overhead, and it does not solve the issue of never releasing memory to
the OS. So we'd still have to rebuild the contexts at some point, I'm afraid.I think there is little additional overhead on each catcache access
-- processing overhead is the same as when using aset, and the memory
overhead is as much as several dozens (which is the number of catcaches)
of MemoryContext structure.
Hmmm. That doesn't seem particularly terrible, I guess.
The slab context (slab.c) returns empty blocks to OS unlike the
allocation context (aset.c).
Slab can do that, but it requires certain allocation pattern, and I very
much doubt syscache has it. It'll be trivial to end with one active
entry on each block (which means slab can't release it).
BTW doesn't syscache store the full on-disk tuple? That doesn't seem
like a fixed-length entry, which is a requirement for slab. No?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
I meant that the time-based eviction is not very good, because it
could cause less frequently entries to vanish even when memory is not
short. Time-based eviction reminds me of Memcached, Redis, DNS, etc.
that evicts long-lived entries to avoid stale data, not to free space
for other entries. I think size-based eviction is sufficient like
shared_buffers, OS page cache, CPU cache, disk cache, etc.Right. But the logic behind time-based approach is that evicting such
entries should not cause any issues exactly because they are accessed
infrequently. It might incur some latency when we need them for the
first time after the eviction, but IMHO that's acceptable (although I
see Andres did not like that).
Yes, that's what I expressed. That is, I'm probably with Andres.
FWIW we might even evict entries after some time passes since inserting
them into the cache - that's what memcached et al do, IIRC. The logic is
that frequently accessed entries will get immediately loaded back (thus
keeping cache hit ratio high). But there are reasons why the other dbs
do that - like not having any cache invalidation (unlike us).
These are what Memcached and Redis do:
1. Evict entries that have lived longer than their TTLs.
This is independent of the cache size. This is to avoid keeping stale data in the cache when the underlying data (such as in the database) is modified. This doesn't apply to PostgreSQL.
2. Evict most least recently accessed entries.
This is to make room for new entries when the cache is full. This is similar or the same as PostgreSQL and other DBMSs do for their database cache. Oracle and MySQL also do this for their dictionary caches, where "dictionary cache" corresponds to syscache in PostgreSQL.
Here's my sketch for this feature. Although it may not meet all (contradictory) requirements as you said, it's simple and familiar for those who have used PostgreSQL and other DBMSs. What do you think? The points are simplicity, familiarity, and memory consumption control for the DBA.
* Add a GUC parameter syscache_size which imposes the upper limit on the total size of all catcaches, not on individual catcache.
The naming follows effective_cache_size. It can be syscache_mem to follow work_mem and maintenance_work_mem.
The default value is 0, which doesn't limit the cache size as now.
* A new member variable in CatCacheHeader tracks the total size of all cached entries.
* A single new LRU list in CatCacheHeader links all cache tuples in LRU order. Each cache access, SearchSCatCacheInternal(), puts the found entry on its front.
* Insertion of a new catcache entry adds the entry size to the total cache size. If the total size exceeds the limit defined by syscache_size, most infrequently accessed entries are removed until the total cache size gets below the limit.
This eviction results in slight overhead when the cache is full, but the response time is steady. On the other hand, with the proposed approach, users will wonder about mysterious long response time due to bulk entry deletions.
In that case, the user can just enlarge the catcache.
IMHO the main issues with this are
(a) It's not quite clear how to determine the appropriate limit. I can
probably apply a bit of perf+gdb, but I doubt that's what very nice.
Like Oracle and MySQL, the user should be able to see the cache hit ratio with a statistics view.
(b) It's not adaptive, so systems that grow over time (e.g. by adding
schemas and other objects) will keep hitting the limit over and over.
The user needs to restart the database instance to enlarge the syscache. That's also true for shared buffers: to accommodate growing amoun of data, the user needs to increase shared_buffers and restart the server.
But the current syscache is local memory, so the server may not need restart.
Just like other caches, we can present a view that shows the hits, misses,
and the hit ratio of the entire catcaches. If the hit ratio is low, the
user can enlarge the catcache size. That's what Oracle and MySQL do as
I referred to in this thread. The tuning parameter is the size. That's
all.How will that work, considering the caches are in private backend
memory? And each backend may have quite different characteristics, even
if they are connected to the same database?
Assuming that pg_stat_syscache (pid, cache_name, hits, misses) gives the statistics, the statistics data can be stored on the shared memory, because the number of backends and the number of catcaches are fixed.
I guess the author meant that the cache is "relatively small" compared
to the underlying storage: CPU cache is smaller than DRAM, DRAM is smaller
than SSD/HDD. In our case, we have to pay more attention to limit the
catcache memory consumption, especially because they are duplicated in
multiple backend processes.I don't think so. IMHO the focus there in on "cost-effective", i.e.
caches are generally more expensive than the storage, so to make them
worth it you need to make them much smaller than the main storage.
I think we're saying the same thing. Perhaps my English is not good enough.
But I don't see how this applies to the problem at hand, because the
system is already split into storage + cache (represented by RAM). The
challenge is how to use RAM to cache various pieces of data to get the
best behavior. The problem is, we don't have a unified cache, but
multiple smaller ones (shared buffers, page cache, syscache) competing
for the same resource.
You're right. On the other hand, we can consider syscache, shared buffers, and page cache as different tiers of storage, even though they are all on DRAM. syscache caches some data from shared buffers for efficient access. If we use much memory for syscache, there's less memory for caching user data in shared buffers and page cache. That's a normal tradeoff of caches.
Slab can do that, but it requires certain allocation pattern, and I very
much doubt syscache has it. It'll be trivial to end with one active
entry on each block (which means slab can't release it).
I expect so, too, although slab context makes efforts to mitigate that possibility like this:
* This also allows various optimizations - for example when searching for
* free chunk, the allocator reuses space from the fullest blocks first, in
* the hope that some of the less full blocks will get completely empty (and
* returned back to the OS).
BTW doesn't syscache store the full on-disk tuple? That doesn't seem
like a fixed-length entry, which is a requirement for slab. No?
Some system catalogs are fixed in size like pg_am and pg_amop. But I guess the number of such catalogs is small. Dominant catalogs like pg_class and pg_attribute are variable size. So using different memory contexts for limited catalogs might not show any visible performance improvement nor memory reduction.
Regards
Takayuki Tsunakawa
At Fri, 8 Feb 2019 09:42:20 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F41EDD1@G01JPEXMBKW04>
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
I made a rerun of benchmark using "-S -T 30" on the server build with no assertion and
-O2. The numbers are the best of three successive attempts. The patched version is
running with cache_target_memory = 0, cache_prune_min_age = 600 and
cache_entry_limit = 0 but pruning doesn't happen by the workload.master: 13393 tps
v12 : 12625 tps (-6%)Significant degradation is found.
Recuded frequency of dlist_move_tail by taking 1ms interval between two succesive
updates on the same entry let the degradation dissapear.patched : 13720 tps (+2%)
It would be good to introduce some interval.
I followed your benchmark (initialized scale factor=10 and others are same option)
and found the same tendency.
These are average of 5 trials.
master: 7640.000538
patch_v12:7417.981378 (3 % down against master)
patch_v13:7645.071787 (almost same as master)
Thank you for cross checking.
These cases are not pruning happen workload as you mentioned.
I'd like to do benchmark of cache-pruning-case as well.
To demonstrate cache-pruning-case
right now I'm making hundreds of partitioned table and run select query for each partitioned table
using pgbench custom file. Maybe using small number of cache_prune_min_age or hard limit would be better.
Are there any good model?
As per Tomas' benchmark, it doesn't seem to harm for the case.
# I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
How about MIN_LRU_UPDATE_INTERVAL?
Looks fine. Fixed in the next version. Thank you for the suggestion.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Thank you for testing and the commits, Tomas.
At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <74386116-0bc5-84f2-e614-0cff19aca2de@2ndquadrant.com>
On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
I've done a bunch of benchmarks on v13, and I don't see any serious
regression either. Each test creates a number of tables (100, 1k, 10k,
100k and 1M) and then runs SELECT queries on them. The tables are
accessed randomly - with either uniform or exponential distribution. For
each combination there are 5 runs, 60 seconds each (see the attached
shell scripts, it should be pretty obvious).I've done the tests on two different machines - small one (i5 with 8GB
of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
almost exactly the same (with the exception of 1M tables, which does not
fit into RAM on the smaller one).On the xeon, the results (throughput compared to master) look like this:
uniform 100 1000 10000 100000 1000000
------------------------------------------------------------
v13 105.04% 100.28% 102.96% 102.11% 101.54%
v13 (nodata) 97.05% 98.30% 97.42% 96.60% 107.55%exponential 100 1000 10000 100000 1000000
------------------------------------------------------------
v13 100.04% 103.48% 101.70% 98.56% 103.20%
v13 (nodata) 97.12% 98.43% 98.86% 98.48% 104.94%The "nodata" case means the tables were empty (so no files created),
while in the other case each table contained 1 row.Per the results it's mostly break even, and in some cases there is
actually a measurable improvement.
Great! I guess it comes from reduced size of hash?
That being said, the question is whether the patch actually reduces
memory usage in a useful way - that's not something this benchmark
validates. I plan to modify the tests to make pgbench script
time-dependent (i.e. to pick a subset of tables depending on time).
Thank you.
A couple of things I've happened to notice during a quick review:
1) The sgml docs in 0002 talk about "syscache_memory_target" and
"syscache_prune_min_age", but those options were renamed to just
"cache_memory_target" and "cache_prune_min_age".
I'm at a loss how call syscache for users. I think it is "catalog
cache". The most basic component is called catcache, which is
covered by the syscache layer, both of then are not revealed to
users, and it is shown to user as "catalog cache".
"catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
exists) "catalog_cache_entry_limit" and
"catalog_cache_prune_ratio" make sense?
2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
defined three times in guc.c for some reason.
It is just PoC, added to show how it looks. (The multiple
instances must bex a result of a convulsion of my fingers..) I
think this is not useful unless it can be specfied per-relation
or per-cache basis. I'll remove the GUC and add reloptions for
the purpose. (But it won't work for pg_class and pg_attribute
for now).
3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
of just using two bool variables prune_by_age and prune_by_number doing
the same thing.
Agreed. It's a kind of memory-stingy, which is useless there.
4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
pretty much mean long-running statements will set the lastaccess to very
old timestamp? Also, it means that long-running statements (like a PL
function accessing a bunch of tables) won't do any eviction at all, no?
AFAICS we'll set the timestamp only once, at the very beginning.I wonder whether using some other timestamp source (like a timestamp
updated regularly from a timer, or something like that).
I didin't consider planning that happen within a function. If
5min is the default for catalog_cache_prune_min_age, 10% of it
(30s) seems enough and gettieofday() with such intervals wouldn't
affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
than fixed value 30s and 1s as the minimal.
I obeserved significant degradation by setting up timer at every
statement start. The patch is doing the followings to get rid of
the degradation.
(1) Every statement updates the catcache timestamp as currently
does. (SetCatCacheClock)
(2) The timestamp is also updated periodically using timer
separately from (1). The timer starts if not yet at the time
of (1). (SetCatCacheClock, UpdateCatCacheClock)
(3) Statement end and transaction end don't stop the timer, to
avoid overhead of setting up a timer. (
(4) But it stops by error. I choosed not to change the thing in
PostgresMain that it kills all timers on error.
(5) Also changing the GUC catalog_cache_prune_min_age kills the
timer, in order to reflect the change quickly especially when
it is shortened.
5) There are two fread() calls in 0003 triggering a compiler warning
about unused return value.
Ugg. It's in PoC style... (But my compiler didn't complain about
it) Maybe fixed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 12 Feb 2019 01:02:39 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB972A6@G01JPEXMBYT05>
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Recuded frequency of dlist_move_tail by taking 1ms interval between two
succesive updates on the same entry let the degradation dissapear.patched : 13720 tps (+2%)
What do you think contributed to this performance increase? Or do you hink this is just a measurement variation?
Most of my previous comments also seem to apply to v13, so let me repost them below:
(1)
(1) +/* GUC variable to define the minimum age of entries that will be cosidered to + /* initilize catcache reference clock if haven't done yet */cosidered -> considered
initilize -> initialize
Fixed. I found "databsae", "temprary", "resturns",
"If'force'"(missing space), "aginst", "maintan". And all fixed.
I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry).
Thank you for pointing some of them.
(2)
Only the doc prefixes "sys" to the new parameter names. Other places don't have it. I think we should prefix sys, because relcache and plancache should be configurable separately because of their different usage patterns/lifecycle.
I tend to agree. They are already removed in this patchset. The
names are changed to "catalog_cache_*" in the new version.
(3)
The doc doesn't describe the unit of syscache_memory_target. Kilobytes?
Added.
(4) + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + tupsize = sizeof(CatCTup);GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/ do.
Thanks. Done. Include bucket and cache header part but still
excluding clist. Renamed from tupsize to memusage.
(5)
+ if (entry_age > cache_prune_min_age)">=" instead of ">"?
I didn't get it serious, but it is better. Fixed.
(6) + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct);It's better to write "ct->c_list == NULL" to follow the style in this file.
"ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been released for a long time, which might hardly happen.
Yeah, I fixed it in v12. This no longer removes an entry in
use. (if (c_list) is used in the file.)
(7)
CatalogCacheCreateEntry+ int tupsize = 0;
if (ntp)
{
int i;
+ int tupsize;tupsize is defined twice.
The second tupsize was bogus, but the first is removed in this
version. Now memory usage of an entry is calculated as a chunk
size.
(8)
CatalogCacheCreateEntryIn the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted. I'm afraid that's not negligible.
Right. Fixed.
(9)
The memory for CatCList is not taken into account for syscache_memory_target.
Yeah, this is intensional since CatCacheList is short lived. Comment added.
| * Don't waste a time by counting the list in catcache memory usage,
| * since a list doesn't persist for a long time
| */
| cl = (CatCList *)
| palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));
Please fine the attached, which is the new version v14 addressing
Tomas', Ideriha-san and your comments.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 2/12/19 12:35 PM, Kyotaro HORIGUCHI wrote:
Thank you for testing and the commits, Tomas.
At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <74386116-0bc5-84f2-e614-0cff19aca2de@2ndquadrant.com>
On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
I've done a bunch of benchmarks on v13, and I don't see any serious
regression either. Each test creates a number of tables (100, 1k, 10k,
100k and 1M) and then runs SELECT queries on them. The tables are
accessed randomly - with either uniform or exponential distribution. For
each combination there are 5 runs, 60 seconds each (see the attached
shell scripts, it should be pretty obvious).I've done the tests on two different machines - small one (i5 with 8GB
of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
almost exactly the same (with the exception of 1M tables, which does not
fit into RAM on the smaller one).On the xeon, the results (throughput compared to master) look like this:
uniform 100 1000 10000 100000 1000000
------------------------------------------------------------
v13 105.04% 100.28% 102.96% 102.11% 101.54%
v13 (nodata) 97.05% 98.30% 97.42% 96.60% 107.55%exponential 100 1000 10000 100000 1000000
------------------------------------------------------------
v13 100.04% 103.48% 101.70% 98.56% 103.20%
v13 (nodata) 97.12% 98.43% 98.86% 98.48% 104.94%The "nodata" case means the tables were empty (so no files created),
while in the other case each table contained 1 row.Per the results it's mostly break even, and in some cases there is
actually a measurable improvement.Great! I guess it comes from reduced size of hash?
Not sure about that. I haven't actually verified that it reduces the
cache size at all - I was measuring the overhead of the extra work. And
I don't think the syscache actually shrunk significantly, because the
throughput was quite high (~15-30k tps, IIRC) so pretty much everything
was touched within the default 600 seconds.
That being said, the question is whether the patch actually reduces
memory usage in a useful way - that's not something this benchmark
validates. I plan to modify the tests to make pgbench script
time-dependent (i.e. to pick a subset of tables depending on time).Thank you.
A couple of things I've happened to notice during a quick review:
1) The sgml docs in 0002 talk about "syscache_memory_target" and
"syscache_prune_min_age", but those options were renamed to just
"cache_memory_target" and "cache_prune_min_age".I'm at a loss how call syscache for users. I think it is "catalog
cache". The most basic component is called catcache, which is
covered by the syscache layer, both of then are not revealed to
users, and it is shown to user as "catalog cache"."catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
exists) "catalog_cache_entry_limit" and
"catalog_cache_prune_ratio" make sense?
I think "catalog_cache" sounds about right, although my point was simply
that there's a discrepancy between sgml docs and code.
2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
defined three times in guc.c for some reason.It is just PoC, added to show how it looks. (The multiple
instances must bex a result of a convulsion of my fingers..) I
think this is not useful unless it can be specfied per-relation
or per-cache basis. I'll remove the GUC and add reloptions for
the purpose. (But it won't work for pg_class and pg_attribute
for now).
OK, although I'd just keep it as simple as possible. TBH I can't really
imagine users tuning limits for individual caches in any meaningful way.
3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
of just using two bool variables prune_by_age and prune_by_number doing
the same thing.Agreed. It's a kind of memory-stingy, which is useless there.
4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
pretty much mean long-running statements will set the lastaccess to very
old timestamp? Also, it means that long-running statements (like a PL
function accessing a bunch of tables) won't do any eviction at all, no?
AFAICS we'll set the timestamp only once, at the very beginning.I wonder whether using some other timestamp source (like a timestamp
updated regularly from a timer, or something like that).I didin't consider planning that happen within a function. If
5min is the default for catalog_cache_prune_min_age, 10% of it
(30s) seems enough and gettieofday() with such intervals wouldn't
affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
than fixed value 30s and 1s as the minimal.
Actually, I see CatCacheCleanupOldEntries contains this comment:
/*
* Calculate the duration from the time of the last access to the
* "current" time. Since catcacheclock is not advanced within a
* transaction, the entries that are accessed within the current
* transaction won't be pruned.
*/
which I think is pretty much what I've been saying ... But the question
is whether we need to do something about it.
I obeserved significant degradation by setting up timer at every
statement start. The patch is doing the followings to get rid of
the degradation.(1) Every statement updates the catcache timestamp as currently
does. (SetCatCacheClock)(2) The timestamp is also updated periodically using timer
separately from (1). The timer starts if not yet at the time
of (1). (SetCatCacheClock, UpdateCatCacheClock)(3) Statement end and transaction end don't stop the timer, to
avoid overhead of setting up a timer. ((4) But it stops by error. I choosed not to change the thing in
PostgresMain that it kills all timers on error.(5) Also changing the GUC catalog_cache_prune_min_age kills the
timer, in order to reflect the change quickly especially when
it is shortened.
Interesting. What was the frequency of the timer / how often was it
executed? Can you share the code somehow?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
I'm at a loss how call syscache for users. I think it is "catalog
cache". The most basic component is called catcache, which is
covered by the syscache layer, both of then are not revealed to
users, and it is shown to user as "catalog cache"."catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
exists) "catalog_cache_entry_limit" and
"catalog_cache_prune_ratio" make sense?
PostgreSQL documentation uses "system catalog" in its table of contents, so syscat_cache_xxx would be a bit more familiar? I'm for either catalog_ and syscat_, but what name shall we use for the relation cache? catcache and relcache have different element sizes and possibly different usage patterns, so they may as well have different parameters just like MySQL does. If we follow that idea, then the name would be relation_cache_xxx. However, from the user's viewpoint, the relation cache is also created from the system catalog like pg_class and pg_attribute...
Regards
Takayuki Tsunakawa
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
I didin't consider planning that happen within a function. If
5min is the default for catalog_cache_prune_min_age, 10% of it
(30s) seems enough and gettieofday() with such intervals wouldn't
affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
than fixed value 30s and 1s as the minimal.Actually, I see CatCacheCleanupOldEntries contains this comment:
/*
* Calculate the duration from the time of the last access to the
* "current" time. Since catcacheclock is not advanced within a
* transaction, the entries that are accessed within the current
* transaction won't be pruned.
*/which I think is pretty much what I've been saying ... But the question
is whether we need to do something about it.
Hmm, I'm surprised at v14 patch about this. I remember that previous patches renewed the cache clock on every statement, and it is correct. If the cache clock is only updated at the beginning of a transaction, the following TODO item would not be solved:
https://wiki.postgresql.org/wiki/Todo
" Reduce memory use when analyzing many tables in a single command by making catcache and syscache flushable or bounded."
Also, Tom mentioned pg_dump in this thread (protect syscache...). pg_dump runs in a single transaction, touching all system catalogs. That may result in OOM, and this patch can rescue it.
Regards
Takayuki Tsunakawa
At Tue, 12 Feb 2019 20:36:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190212.203628.118792892.horiguchi.kyotaro@lab.ntt.co.jp>
(4) + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + tupsize = sizeof(CatCTup);GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/ do.
Thanks. Done. Include bucket and cache header part but still
excluding clist. Renamed from tupsize to memusage.
It is too complex as I was afraid. The indirect calls causes
siginicant degradation. (Anyway the previous code was bogus in
that it passes CACHELINEALIGN'ed pointer to get_chunk_size..)
Instead, I added an accounting(?) interface function.
| MemoryContextGettConsumption(MemoryContext cxt);
The API returns the current consumption in this memory
context. This allows "real" memory accounting almost without
overhead.
(1) New patch v15-0002 adds accounting feature to MemoryContext.
(It adds this feature only to AllocSet, if this is acceptable
it can be extended to other allocators.)
(2) Another new patch v15-0005 on top of previous design of
limit-by-number-of-a-cache feature converts it to
limit-by-size-on-all-caches feature, which I think is
Tsunakawa-san wanted.
As far as I can see no significant degradation is found in usual
(as long as pruning doesn't happen) code paths.
About the new global-size based evicition(2), cache entry
creation becomes slow after the total size reached to the limit
since every one new entry evicts one or more old (=
not-recently-used) entries. Because of not needing knbos for each
cache, it become far realistic. So I added documentation of
"catalog_cache_max_size" in 0005.
About the age-based eviction, the bulk eviction seems to take a a
bit long time but it happnes instead of hash resizing so the user
doesn't observe additional slowdown. On the contrary the pruning
can avoid rehashing scanning the whole cache. I think it is the
gain seen in the Tomas' experiment.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Wed, 13 Feb 2019 02:15:42 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB97CF1@G01JPEXMBYT05>
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
I didin't consider planning that happen within a function. If
5min is the default for catalog_cache_prune_min_age, 10% of it
(30s) seems enough and gettieofday() with such intervals wouldn't
affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
than fixed value 30s and 1s as the minimal.Actually, I see CatCacheCleanupOldEntries contains this comment:
/*
* Calculate the duration from the time of the last access to the
* "current" time. Since catcacheclock is not advanced within a
* transaction, the entries that are accessed within the current
* transaction won't be pruned.
*/which I think is pretty much what I've been saying ... But the question
is whether we need to do something about it.Hmm, I'm surprised at v14 patch about this. I remember that previous patches renewed the cache clock on every statement, and it is correct. If the cache clock is only updated at the beginning of a transaction, the following TODO item would not be solved:
Sorry, its just a stale comment. In v15, it is alreday.... ouch!
still left alone. (Actually CatCacheGetStats doesn't perform
pruning.) I'll remove it in the next version. It is called in
start_xact_command, which is called per statement, provided with
statement timestamp.
/*
* Calculate the duration from the time from the last access to
* the "current" time. catcacheclock is updated per-statement
* basis and additionaly udpated periodically during a long
* running query.
*/
TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
" Reduce memory use when analyzing many tables in a single command by making catcache and syscache flushable or bounded."
In v14 and v15, addition to it a timer that fires with the
interval of catalog_cache_prune_min_age/10 - 30s when the
parameter is 5min - updates the catcache clock using
gettimeofday(), which in turn is the source of LRU timestamp.
Also, Tom mentioned pg_dump in this thread (protect syscache...). pg_dump runs in a single transaction, touching all system catalogs. That may result in OOM, and this patch can rescue it.
So, all the problem will be addressed in v14.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 12 Feb 2019 18:33:46 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <d3b291ff-d993-78d1-8d28-61bcf72793d6@2ndquadrant.com>
"catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
exists) "catalog_cache_entry_limit" and
"catalog_cache_prune_ratio" make sense?I think "catalog_cache" sounds about right, although my point was simply
that there's a discrepancy between sgml docs and code.
system_catalog_cache is too long for parameter names. So I named
parameters "catalog_cache_*" and "system catalog cache" or
"catalog cache" in documentation.
2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
defined three times in guc.c for some reason.It is just PoC, added to show how it looks. (The multiple
instances must bex a result of a convulsion of my fingers..) I
think this is not useful unless it can be specfied per-relation
or per-cache basis. I'll remove the GUC and add reloptions for
the purpose. (But it won't work for pg_class and pg_attribute
for now).OK, although I'd just keep it as simple as possible. TBH I can't really
imagine users tuning limits for individual caches in any meaningful way.
I also fee like so, but anyway (:p), in v15, it is evoleved into
a feature that limits cache size with the total size based on
global LRU list.
I didin't consider planning that happen within a function. If
5min is the default for catalog_cache_prune_min_age, 10% of it
(30s) seems enough and gettieofday() with such intervals wouldn't
affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
than fixed value 30s and 1s as the minimal.Actually, I see CatCacheCleanupOldEntries contains this comment:
/*
* Calculate the duration from the time of the last access to the
* "current" time. Since catcacheclock is not advanced within a
* transaction, the entries that are accessed within the current
* transaction won't be pruned.
*/which I think is pretty much what I've been saying ... But the question
is whether we need to do something about it.
As I wrote in the messages just replied to Tsunakawa-san, it just
a bogus comment. The corrent one is the following. I'll replace
it in the next version.
* Calculate the duration from the time from the last access to
* the "current" time. catcacheclock is updated per-statement
* basis and additionaly udpated periodically during a long
* running query.
I obeserved significant degradation by setting up timer at every
statement start. The patch is doing the followings to get rid of
the degradation.(1) Every statement updates the catcache timestamp as currently
does. (SetCatCacheClock)(2) The timestamp is also updated periodically using timer
separately from (1). The timer starts if not yet at the time
of (1). (SetCatCacheClock, UpdateCatCacheClock)(3) Statement end and transaction end don't stop the timer, to
avoid overhead of setting up a timer. ((4) But it stops by error. I choosed not to change the thing in
PostgresMain that it kills all timers on error.(5) Also changing the GUC catalog_cache_prune_min_age kills the
timer, in order to reflect the change quickly especially when
it is shortened.Interesting. What was the frequency of the timer / how often was it
executed? Can you share the code somehow?
Please find it in v14 [1]/messages/by-id/20190212.203628.118792892.horiguchi.kyotaro@lab.ntt.co.jp or v15 [2]/messages/by-id/20190213.153114.239737674.horiguchi.kyotaro@lab.ntt.co.jp, which contain the same code
for teh purpose.
[1]: /messages/by-id/20190212.203628.118792892.horiguchi.kyotaro@lab.ntt.co.jp
[2]: /messages/by-id/20190213.153114.239737674.horiguchi.kyotaro@lab.ntt.co.jp
regarsd.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Feb 12, 2019 at 02:53:40AM +0100, Tomas Vondra wrote:
Right. But the logic behind time-based approach is that evicting such
entries should not cause any issues exactly because they are accessed
infrequently. It might incur some latency when we need them for the
first time after the eviction, but IMHO that's acceptable (although I
see Andres did not like that).FWIW we might even evict entries after some time passes since inserting
them into the cache - that's what memcached et al do, IIRC. The logic is
that frequently accessed entries will get immediately loaded back (thus
keeping cache hit ratio high). But there are reasons why the other dbs
do that - like not having any cache invalidation (unlike us).
Agreed. If this fixes 90% of the issues people will have, and it
applies to the 99.9% of users who will never tune this, it is a clear
win. If we want to add something that requires tuning later, we can
consider it once the non-tuning solution is done.
That being said, having a "minimal size" threshold before starting with
the time-based eviction may be a good idea.
Agreed. I see the minimal size as a way to keep the systems tables in
cache, which we know we will need for the next query.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
From: Bruce Momjian [mailto:bruce@momjian.us]
That being said, having a "minimal size" threshold before starting with
the time-based eviction may be a good idea.Agreed. I see the minimal size as a way to keep the systems tables in
cache, which we know we will need for the next query.
Isn't it the maximum size, not minimal size? Maximum size allows to keep desired amount of system tables in memory as well as to control memory consumption to avoid out-of-memory errors (OS crash!). I'm wondering why people want to take a different approach to catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers, temp_buffers, SLRU buffers, work_mem, and other DBMSs.
Regards
Takayuki Tsunakawa
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
It is too complex as I was afraid. The indirect calls causes siginicant
degradation. (Anyway the previous code was bogus in that it passes
CACHELINEALIGN'ed pointer to get_chunk_size..)Instead, I added an accounting(?) interface function.
| MemoryContextGettConsumption(MemoryContext cxt);
The API returns the current consumption in this memory context. This allows
"real" memory accounting almost without overhead.
That looks like a great idea! Actually, I was thinking of using MemoryContextStats() or its new lightweight variant to get the used amount, but I was afraid it would be too costly to call in catcache code. You are smarter, and I was just stupid.
(2) Another new patch v15-0005 on top of previous design of
limit-by-number-of-a-cache feature converts it to
limit-by-size-on-all-caches feature, which I think is
Tsunakawa-san wanted.
Thank you very, very much! I look forward to reviewing v15. I'll be away from the office tomorrow, so I'd like to review it on this weekend or the beginning of next week. I've confirmed and am sure that 0001 can be committed.
As far as I can see no significant degradation is found in usual (as long
as pruning doesn't happen) code paths.About the new global-size based evicition(2), cache entry creation becomes
slow after the total size reached to the limit since every one new entry
evicts one or more old (=
not-recently-used) entries. Because of not needing knbos for each cache,
it become far realistic. So I added documentation of
"catalog_cache_max_size" in 0005.
Could you show us the comparison of before and after the pruning starts, if you already have it? If you lost the data, I'm OK to see the data after the code review.
Regards
Takayuki Tsunakawa
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
(2) Another new patch v15-0005 on top of previous design of
limit-by-number-of-a-cache feature converts it to
limit-by-size-on-all-caches feature, which I think is
Tsunakawa-san wanted.
Yeah, size looks better to me.
As far as I can see no significant degradation is found in usual (as long as pruning
doesn't happen) code paths.About the new global-size based evicition(2), cache entry creation becomes slow after
the total size reached to the limit since every one new entry evicts one or more old (=
not-recently-used) entries. Because of not needing knbos for each cache, it become
far realistic. So I added documentation of "catalog_cache_max_size" in 0005.
Now I'm also trying to benchmark, which will be posted in another email.
Here are things I noticed:
[1]: compiler warning catcache.c:109:1: warning: missing braces around initializer [-Wmissing-braces] dlist_head cc_lru_list = {0}; ^ catcache.c:109:1: warning: (near initialization for ‘cc_lru_list.head’) [-Wmissing-braces]
catcache.c:109:1: warning: missing braces around initializer [-Wmissing-braces]
dlist_head cc_lru_list = {0};
^
catcache.c:109:1: warning: (near initialization for ‘cc_lru_list.head’) [-Wmissing-braces]
[2]: catalog_cache_max_size is not appered in postgresql.conf.sample
[3]: global lru list and global size can be included in CatCacheHeader, which seems to me good place because this structure contains global cache information regardless of kind of CatCache
good place because this structure contains global cache information regardless of kind of CatCache
[4]: when applying patch with git am, there are several warnings about trailing white space at v15-0003
Regards,
Takeshi Ideriha
Hi,
On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
Instead, I added an accounting(?) interface function.
| MemoryContextGettConsumption(MemoryContext cxt);
The API returns the current consumption in this memory
context. This allows "real" memory accounting almost without
overhead.
That's definitely *NOT* almost without overhead. This adds additional
instructions to one postgres' hottest set of codepaths.
I think you're not working incrementally enough here. I strongly suggest
solving the negative cache entry problem, and then incrementally go from
there after that's committed. The likelihood of this patch ever getting
merged otherwise seems extremely small.
Greetings,
Andres Freund
On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote:
Hi,
On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
Instead, I added an accounting(?) interface function.
| MemoryContextGettConsumption(MemoryContext cxt);
The API returns the current consumption in this memory
context. This allows "real" memory accounting almost without
overhead.That's definitely *NOT* almost without overhead. This adds additional
instructions to one postgres' hottest set of codepaths.I think you're not working incrementally enough here. I strongly suggest
solving the negative cache entry problem, and then incrementally go from
there after that's committed. The likelihood of this patch ever getting
merged otherwise seems extremely small.
Agreed --- the patch is going in the wrong direction.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
On Thu, Feb 14, 2019 at 01:31:49AM +0000, Tsunakawa, Takayuki wrote:
From: Bruce Momjian [mailto:bruce@momjian.us]
That being said, having a "minimal size" threshold before starting
with the time-based eviction may be a good idea.Agreed. I see the minimal size as a way to keep the systems tables
in cache, which we know we will need for the next query.Isn't it the maximum size, not minimal size? Maximum size allows
to keep desired amount of system tables in memory as well as to
control memory consumption to avoid out-of-memory errors (OS crash!).
I'm wondering why people want to take a different approach to
catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers,
temp_buffers, SLRU buffers, work_mem, and other DBMSs.
Well, that is an _excellent_ question, and one I had to think about.
I think, in general, smaller is better, as long as making something
smaller doesn't remove data that is frequently accessed. Having a timer
to expire only old entries seems like it accomplished this goal.
Having a minimum size and not taking it to zero size makes sense if we
know we will need certain entries like pg_class in the next query.
However, if the session is idle for hours, we should just probably
remove everything, so maybe the minimum doesn't make sense --- just
remove everything.
As for why we don't do this with everything --- we can't do it with
shared_buffers since we can't change its size while the server is
running. For work_mem, we assume all the work_mem data is for the
current query, and therefore frequently accessed. Also, work_mem is not
memory we can just free if it is not used since it contains intermediate
results required by the current query. I think temp_buffers, since it
can be resized in the session, actually could use a similar minimizing
feature, though that would mean it behaves slightly differently from
shared_buffers, and it might not be worth it. Also, I assume the value
of temp_buffers was mostly for use by the current query --- yes, it can
be used for cross-query caching, but I am not sure if that is its
primary purpose. I thought its goal was to prevent shared_buffers from
being populated with temporary per-session buffers.
I don't think other DBMSs are a good model since they have a reputation
for requiring a lot of tuning --- tuning that we have often automated.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
About the new global-size based evicition(2), cache entry creation
becomes slow after the total size reached to the limit since every one
new entry evicts one or more old (=
not-recently-used) entries. Because of not needing knbos for each
cache, it become far realistic. So I added documentation of"catalog_cache_max_size" in 0005.
Now I'm also trying to benchmark, which will be posted in another email.
According to recent comments by Andres and Bruce
maybe we should address negative cache bloat step by step
for example by reviewing Tom's patch.
But at the same time, I did some benchmark with only hard limit option enabled
and time-related option disabled, because the figures of this case are not provided in this thread.
So let me share it.
I did two experiments. One is to show negative cache bloat is suppressed.
This thread originated from the issue that negative cache of pg_statistics
is bloating as creating and dropping temp table is repeatedly executed.
/messages/by-id/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp
Using the script attached the first email in this thread, I repeated create and drop temp table at 10000 times.
(experiment is repeated 5 times. catalog_cache_max_size = 500kB.
compared master branch and patch with hard memory limit)
Here are TPS and CacheMemoryContext 'used' memory (total - freespace) calculated by MemoryContextPrintStats()
at 100, 1000, 10000 times of create-and-drop transaction. The result shows cache bloating is suppressed
after exceeding the limit (at 10000) but tps declines regardless of the limit.
number of tx (create and drop) | 100 |1000 |10000
-----------------------------------------------------------
used CacheMemoryContext (master) |610296|2029256 |15909024
used CacheMemoryContext (patch) |755176|880552 |880592
-----------------------------------------------------------
TPS (master) |414 |407 |399
TPS (patch) |242 |225 |220
Another experiment is using Tomas's script posted while ago,
The scenario is do select 1 from multiple tables randomly (uniform distribution).
(experiment is repeated 5 times. catalog_cache_max_size = 10MB.
compared master branch and patch with only hard memory limit enabled)
Before doing the benchmark, I checked pruning is happened only at 10000 tables
using debug option. The result shows degradation regardless of before or after pruning.
I personally still need hard size limitation but I'm surprised that the difference is so significant.
number of tables | 100 |1000 |10000
-----------------------------------------------------------
TPS (master) |10966 |10654 |9099
TPS (patch) |4491 |2099 |378
Regards,
Takeshi Ideriha
On 2/13/19 1:23 AM, Tsunakawa, Takayuki wrote:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
I'm at a loss how call syscache for users. I think it is "catalog
cache". The most basic component is called catcache, which is
covered by the syscache layer, both of then are not revealed to
users, and it is shown to user as "catalog cache"."catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
exists) "catalog_cache_entry_limit" and
"catalog_cache_prune_ratio" make sense?PostgreSQL documentation uses "system catalog" in its table of contents, so syscat_cache_xxx would be a bit more familiar? I'm for either catalog_ and syscat_, but what name shall we use for the relation cache? catcache and relcache have different element sizes and possibly different usage patterns, so they may as well have different parameters just like MySQL does. If we follow that idea, then the name would be relation_cache_xxx. However, from the user's viewpoint, the relation cache is also created from the system catalog like pg_class and pg_attribute...
I think "catalog_cache_..." is fine. If we end up with a similar
patchfor relcache, we can probably call it "relation_cache_".
I'd be OK even with "system_catalog_cache_..." - I don't think it's
overly long (better to have a longer but descriptive name), and "syscat"
just seems like unnecessary abbreviation.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/14/19 3:46 PM, Bruce Momjian wrote:
On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote:
Hi,
On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
Instead, I added an accounting(?) interface function.
| MemoryContextGettConsumption(MemoryContext cxt);
The API returns the current consumption in this memory
context. This allows "real" memory accounting almost without
overhead.That's definitely *NOT* almost without overhead. This adds additional
instructions to one postgres' hottest set of codepaths.I think you're not working incrementally enough here. I strongly suggest
solving the negative cache entry problem, and then incrementally go from
there after that's committed. The likelihood of this patch ever getting
merged otherwise seems extremely small.Agreed --- the patch is going in the wrong direction.
I recall endless discussions about memory accounting in the
"memory-bounded hash-aggregate" patch a couple of years ago, and the
overhead was one of the main issues there. So yeah, trying to solve that
problem here is likely to kill this patch (or at least significantly
delay it).
ISTM there's a couple of ways to deal with that:
1) Ignore the memory amounts entirely, and do just time-base eviction.
2) If we want some size thresholds (e.g. to disable eviction for
backends with small caches etc.) use the number of entries instead. I
don't think that's particularly worse that specifying size in MB.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/14/19 4:49 PM, 'Bruce Momjian' wrote:
On Thu, Feb 14, 2019 at 01:31:49AM +0000, Tsunakawa, Takayuki wrote:
From: Bruce Momjian [mailto:bruce@momjian.us]
That being said, having a "minimal size" threshold before starting
with the time-based eviction may be a good idea.Agreed. I see the minimal size as a way to keep the systems tables
in cache, which we know we will need for the next query.Isn't it the maximum size, not minimal size? Maximum size allows
to keep desired amount of system tables in memory as well as to
control memory consumption to avoid out-of-memory errors (OS crash!).
I'm wondering why people want to take a different approach to
catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers,
temp_buffers, SLRU buffers, work_mem, and other DBMSs.Well, that is an _excellent_ question, and one I had to think about.
I think we're talking about two different concepts here:
1) minimal size - We don't do any extra eviction at all, until we reach
this cache size. So we don't get any extra overhead from it. If a system
does not have issues.
2) maximal size - We ensure the cache size is below this threshold. If
there's more data, we evict enough entries to get below it.
My proposal is essentially to do just (1), so the cache can grow very
large if needed but then it shrinks again after a while.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
I think "catalog_cache_..." is fine. If we end up with a similar
patchfor relcache, we can probably call it "relation_cache_".
Agreed, those are not too long or too short, and they are sufficiently descriptive.
Regards
Takayuki Tsunakawa
On 2019-Feb-15, Tomas Vondra wrote:
ISTM there's a couple of ways to deal with that:
1) Ignore the memory amounts entirely, and do just time-base eviction.
2) If we want some size thresholds (e.g. to disable eviction for
backends with small caches etc.) use the number of entries instead. I
don't think that's particularly worse that specifying size in MB.
Why is there a *need* for size-based eviction? Seems that time-based
should be sufficient. Is the proposed approach to avoid eviction at all
until the size threshold has been reached? I'm not sure I see the point
of that.
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Horiguchi-san,
I've looked through your patches. This is the first part of my review results. Let me post the rest after another work today.
BTW, how about merging 0003 and 0005, and separating and deferring 0004 in another thread? That may help to relieve other community members by making this patch set not so large and complex.
[Bottleneck investigation]
Ideriha-san and I are trying to find the bottleneck. My first try shows there's little overhead. Here's what I did:
<postgresql.conf>
shared_buffers = 1GB
catalog_cache_prune_min_age = -1
catalog_cache_max_size = 10MB
<benchmark>
$ pgbench -i -s 10
$ pg_ctl stop and then start
$ cache all data in shared buffers by running pg_prewarm on branches, tellers, accounts, and their indexes
$ pgbench --select-only -c 1 -T 60
<result>
master : 8612 tps
patched: 8553 tps (-0.7%)
There's little (0.7%) performance overhead with:
* one additional dlist_move_tail() in every catcache access
* memory usage accounting in operations other than catcache access (relevant catcache entries should be cached in the first pgbench transaction)
I'll check other patterns to find out how big overhead there is.
[Source code review]
Below are my findings on the patch set v15:
(1) patch 0001
All right.
(2) patch 0002
@@ -87,6 +87,7 @@ typedef struct MemoryContextData
const char *name; /* context name (just for debugging) */
const char *ident; /* context ID if any (just for debugging) */
MemoryContextCallback *reset_cbs; /* list of reset/delete callbacks */
+ uint64 consumption; /* accumulates consumed memory size */
} MemoryContextData;
Size is more appropriate as a data type than uint64 because other places use Size for memory size variables.
How about "usedspace" instead of "consumption"? Because that aligns better with the naming used for MemoryContextCounters's member variables, totalspace and freespace.
(3) patch 0002
+ context->consumption += chunk_size;
(and similar sites)
The used space should include the size of the context-type-specific chunk header, so that the count is closer to the actual memory size seen by the user.
Here, let's make consensus on what the used space represents. Is it either of the following?
a) The total space allocated from OS. i.e., the sum of the malloc()ed regions for a given memory context.
b) The total space of all chunks, including their headers, of a given memory context.
a) is better because that's the actual memory usage from the DBA's standpoint. But a) cannot be used because CacheMemoryContext is used for various things. So we have to compromise on b). Is this OK?
One possible future improvement is to use a separate memory context exclusively for the catcache, which is a child of CacheMemoryContext. That way, we can adopt a).
(4) patch 0002
@@ -614,6 +614,9 @@ AllocSetReset(MemoryContext context)
+ set->header.consumption = 0;
This can be put in MemoryContextResetOnly() instead of context-type-specific reset functions.
Regards
Takayuki Tsunakawa
Hi Horiguchi-san,
This is the rest of my review comments.
(5) patch 0003
CatcacheClockTimeoutPending = 0;
+
+ /* Update timetamp then set up the next timeout */
+
false is better than 0, to follow other **Pending variables.
timetamp -> timestamp
(6) patch 0003
GetCatCacheClock() is not used now. Why don't we add it when the need arises?
(7) patch 0003
Why don't we remove the catcache timer (Setup/UpdateCatCacheClockTimer), unless we need it by all means? That simplifies the code.
Long-running queries can be thought as follows:
* A single lengthy SQL statement, e.g. SELECT for reporting/analytics, COPY for data loading, and UPDATE/DELETE for batch processing, should only require small number of catalog entries during their query analysis/planning. They won't suffer from cache eviction during query execution.
* Do not have to evict cache entries while executing a long-running stored procedure, because its constituent SQL statements may access the same tables. If the stored procedure accesses so many tables that you are worried about the catcache memory overuse, then catalog_cache_max_size can be used. Another natural idea would be to update the cache clock when SPI executes each SQL statement.
(8) patch 0003
+ uint64 base_size;
+ uint64 base_size = MemoryContextGetConsumption(CacheMemoryContext);
This may also as well be Size, not uint64.
(9) patch 0003
@@ -1940,7 +2208,7 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys)
/*
* Helper routine that copies the keys in the srckeys array into the dstkeys
* one, guaranteeing that the datums are fully allocated in the current memory
- * context.
+ * context. Returns allocated memory size.
*/
static void
CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
@@ -1976,7 +2244,6 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
att->attbyval,
att->attlen);
}
-
}
This change seem to be no longer necessary thanks to the memory accounting.
(10) patch 0004
How about separating this in another thread, so that the rest of the patch set becomes easier to review and commit?
Regarding the design, I'm inclined to avoid each backend writing the file. To simplify the code, I think we can take advantage of the fortunate situation -- the number of backends and catcaches are fixed at server startup. My rough sketch is:
* Allocate an array of statistics entries in shared memory, whose element is (pid or backend id, catcache id or name, hits, misses, ...). The number of array elements is MaxBackends * number of catcaches (some dozens).
* Each backend updates its own entry in the shared memory during query execution.
* Stats collector periodically scans the array and write it to the stats file.
(11) patch 0005
+dlist_head cc_lru_list = {0};
+Size global_size = 0;
It is better to put these in CatCacheHeader. That way, backends that do not access the catcache (archiver, stats collector, etc.) do not have to waste memory for these global variables.
(12) patch 0005
+ else if (catalog_cache_max_size > 0 &&
+ global_size > catalog_cache_max_size * 1024)
CatCacheCleanupOldEntries(cache);
On the second line, catalog_cache_max_size should be cast to Size to avoid overflow.
(13) patch 0005
+ gettext_noop("Sets the maximum size of catcache in kilobytes."),
catcache -> catalog cache
(14) patch 0005
+ CatCache *owner; /* owner catcache */
CatCTup already has my_cache member.
(15) patch 0005
if (nremoved > 0)
elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
cp->id, cp->cc_relname, nremoved, nelems_before);
In prune-by-size case, this elog doesn't very meaningful data. How about dividing this function into two, one is for prune-by-age and another for prune-by-size? I supppose that would make the functions easier to understand.
Regards
Takayuki Tsunakawa
From: 'Bruce Momjian' [mailto:bruce@momjian.us]
I think, in general, smaller is better, as long as making something
smaller doesn't remove data that is frequently accessed. Having a timer
to expire only old entries seems like it accomplished this goal.Having a minimum size and not taking it to zero size makes sense if we
know we will need certain entries like pg_class in the next query.
However, if the session is idle for hours, we should just probably
remove everything, so maybe the minimum doesn't make sense --- just
remove everything.
That's another interesting idea. A somewhat relevant feature is Oracle's "ALTER SYSTEM FLUSH SHARED_POOL". It flushes all dictionary cache, library cache, and SQL plan entries. The purpose is different: not to release memory, but to defragment the shared memory.
I don't think other DBMSs are a good model since they have a reputation
for requiring a lot of tuning --- tuning that we have often automated.
Yeah, I agree that PostgreSQL is easier to use in many aspects.
On the other hand, although I hesitate to say this (please don't get upset...), I feel PostgreSQL is a bit too loose about memory usage. To my memory, PostgreSQL crashed OS due to OOM in our user environments:
* Creating and dropping temp tables repeatedly in a stored PL/pgSQL function. This results in infinite CacheMemoryContext bloat. This is referred to at the beginning of this mail thread.
Oracle and MySQL can limit the size of the dictionary cache.
* Each pair of SAVEPOINT/RELEASE leaves 8KB of CurTransactionContext. The customer used psqlODBC to run a batch app, which ran millions of SQL statements in a transaction. psqlODBC wraps each SQL statement with SAVEPOINT and RELEASE by default.
I guess this is what caused the crash of AWS Aurora in last year's Amazon Prime Day.
* Setting a large value to work_mem, and then run many concurrent large queries.
Oracle can limit the total size of all sessions' memory with PGA_AGGREGATE_TARGET parameter.
We all have to manage things within resource constraints. The DBA wants to make sure the server doesn't overuse memory to avoid crash or slowdown due to swapping. Oracle does it, and another open source database, MySQL, does it too. PostgreSQL does it with shared_buffers, wal_buffers, and work_mem (within a single session). Then, I thought it's natural to do it with catcache/relcache/plancache.
Regards
Takayuki Tsunakawa
On 2/19/19 12:43 AM, Tsunakawa, Takayuki wrote:
Hi Horiguchi-san,
I've looked through your patches. This is the first part of my review results. Let me post the rest after another work today.
BTW, how about merging 0003 and 0005, and separating and deferring 0004 in another thread? That may help to relieve other community members by making this patch set not so large and complex.
[Bottleneck investigation]
Ideriha-san and I are trying to find the bottleneck. My first try shows there's little overhead. Here's what I did:<postgresql.conf>
shared_buffers = 1GB
catalog_cache_prune_min_age = -1
catalog_cache_max_size = 10MB<benchmark>
$ pgbench -i -s 10
$ pg_ctl stop and then start
$ cache all data in shared buffers by running pg_prewarm on branches, tellers, accounts, and their indexes
$ pgbench --select-only -c 1 -T 60<result>
master : 8612 tps
patched: 8553 tps (-0.7%)There's little (0.7%) performance overhead with:
* one additional dlist_move_tail() in every catcache access
* memory usage accounting in operations other than catcache access (relevant catcache entries should be cached in the first pgbench transaction)I'll check other patterns to find out how big overhead there is.
0.7% may easily be just a noise, possibly due to differences in layout
of the binary. How many runs? What was the variability of the results
between runs? What hardware was this tested on?
FWIW I doubt tests with such small small schema are proving anything -
the cache/lists are likely tiny. That's why I tested with much larger
number of relations.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
But at the same time, I did some benchmark with only hard limit option enabled and
time-related option disabled, because the figures of this case are not provided in this
thread.
So let me share it.
I'm sorry but I'm taking back result about patch and correcting it.
I configured postgresql (master) with only CFLAGS=O2
but I misconfigured postgres (path applied) with
--enable-cassert --enable-debug --enable-tap-tests 'CFLAGS=-O0'.
These debug option (especially --enable-cassert) caused enourmous overhead.
(I thought I checked the configure option.. I was maybe tired.)
So I changed these to only 'CFLAGS=-O2' and re-measured them.
I did two experiments. One is to show negative cache bloat is suppressed.
This thread originated from the issue that negative cache of pg_statistics is bloating as
creating and dropping temp table is repeatedly executed.
/messages/by-id/20161219.201505.11562604.horiguchi.kyot
aro%40lab.ntt.co.jp
Using the script attached the first email in this thread, I repeated create and drop
temp table at 10000 times.
(experiment is repeated 5 times. catalog_cache_max_size = 500kB.
compared master branch and patch with hard memory limit)Here are TPS and CacheMemoryContext 'used' memory (total - freespace) calculated
by MemoryContextPrintStats() at 100, 1000, 10000 times of create-and-drop
transaction. The result shows cache bloating is suppressed after exceeding the limit
(at 10000) but tps declines regardless of the limit.number of tx (create and drop) | 100 |1000 |10000
-----------------------------------------------------------
used CacheMemoryContext (master) |610296|2029256 |15909024 used
CacheMemoryContext (patch) |755176|880552 |880592
-----------------------------------------------------------
TPS (master) |414 |407 |399
TPS (patch) |242 |225 |220
Correct one:
number of tx (create and drop) | 100 |1000 |10000
-----------------------------------------------------------
TPS (master) |414 |407 |399
TPS (patch) |447 |415 |409
The results between master and patch is almost same.
Another experiment is using Tomas's script posted while ago, The scenario is do select
1 from multiple tables randomly (uniform distribution).
(experiment is repeated 5 times. catalog_cache_max_size = 10MB.
compared master branch and patch with only hard memory limit enabled)Before doing the benchmark, I checked pruning is happened only at 10000 tables using
debug option. The result shows degradation regardless of before or after pruning.
I personally still need hard size limitation but I'm surprised that the difference is so
significant.number of tables | 100 |1000 |10000
-----------------------------------------------------------
TPS (master) |10966 |10654 |9099
TPS (patch) |4491 |2099 |378
Correct one:
number of tables | 100 |1000 |10000
-----------------------------------------------------------
TPS (master) |10966 |10654 |9099
TPS (patch) | 11137 (+1%) |10710 (+0%) |772 (-91%)
It seems that before cache exceeding the limit (no pruning at 100 and 1000),
the results are almost same with master but after exceeding the limit (at 10000)
the decline happens.
Regards,
Takeshi Ideriha
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
number of tables | 100 |1000 |10000
-----------------------------------------------------------
TPS (master) |10966 |10654 |9099
TPS (patch) | 11137 (+1%) |10710 (+0%) |772 (-91%)It seems that before cache exceeding the limit (no pruning at 100 and 1000),
the results are almost same with master but after exceeding the limit (at
10000)
the decline happens.
How many concurrent clients?
Can you show the perf's call graph sampling profiles of both the unpatched and patched version, to confirm that the bottleneck is around catcache eviction and refill?
Regards
Takayuki Tsunakawa
At Thu, 14 Feb 2019 00:40:10 -0800, Andres Freund <andres@anarazel.de> wrote in <20190214084010.bdn6tmba2j7szo3m@alap3.anarazel.de>
Hi,
On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
Instead, I added an accounting(?) interface function.
| MemoryContextGettConsumption(MemoryContext cxt);
The API returns the current consumption in this memory
context. This allows "real" memory accounting almost without
overhead.That's definitely *NOT* almost without overhead. This adds additional
instructions to one postgres' hottest set of codepaths.
I'm not sure how much the two instructions in AllocSetAlloc
actually impacts, but I agree that it is doubtful that the
size-limit feature worth the possible slowdown in any extent.
# I faintly remember that I tried the same thing before..
I think you're not working incrementally enough here. I strongly suggest
solving the negative cache entry problem, and then incrementally go from
there after that's committed. The likelihood of this patch ever getting
merged otherwise seems extremely small.
Mmm. Scoping to the negcache prolem, my very first patch posted
two-years ago does that based on invalidation for pg_statistic
and pg_class, like I think Tom have suggested somewhere in this
thread.
/messages/by-id/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp
This is completely different approach from the current shape and
it would be useless after pruning is introduced. So I'd like to
go for the generic pruning by age.
Difference from v15:
Removed AllocSet accounting stuff. We use approximate memory
size for catcache.
Removed prune-by-number(or size) stuff.
Adressing comments from Tsunakawa-san and Ideriha-san .
Separated catcache monitoring feature. (Removed from this set)
(But it is crucial to check this feature...)
Is this small enough ?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Tue, Feb 19, 2019 at 11:15 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Difference from v15:
Removed AllocSet accounting stuff. We use approximate memory
size for catcache.Removed prune-by-number(or size) stuff.
Adressing comments from Tsunakawa-san and Ideriha-san .
Separated catcache monitoring feature. (Removed from this set)
(But it is crucial to check this feature...)Is this small enough ?
The commit message in 0002 says 'This also can put a hard limit on the
number of catcache entries.' but neither of the GUCs that you've
documented have that effect. Is that a leftover from a previous
version?
I'd like to see some evidence that catalog_cache_memory_target has any
value, vs. just always setting it to zero. I came up with the
following somewhat artificial example that shows that it might have
value.
rhaas=# create table foo (a int primary key, b text) partition by hash (a);
[rhaas pgsql]$ perl -e 'for (0..9999) { print "CREATE TABLE foo$_
PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; }'
| psql
First execution of 'select * from foo' in a brand new session takes
about 1.9 seconds; subsequent executions take about 0.7 seconds. So,
if catalog_cache_memory_target were set to a high enough value to
allow all of that stuff to remain in cache, we could possibly save
about 1.2 seconds coming off the blocks after a long idle period.
That might be enough to justify having the parameter. But I'm not
quite sure how high the value would need to be set to actually get the
benefit in a case like that, or what happens if you set it to a value
that's not quite high enough. I think it might be good to play around
some more with cases like this, just to get a feeling for how much
time you can save in exchange for how much memory.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Tsunakawa, Takayuki
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
number of tables | 100 |1000 |10000
-----------------------------------------------------------
TPS (master) |10966 |10654 |9099
TPS (patch) | 11137 (+1%) |10710 (+0%) |772 (-91%)It seems that before cache exceeding the limit (no pruning at 100 and
1000), the results are almost same with master but after exceeding the
limit (at
10000)
the decline happens.How many concurrent clients?
One client (default setting).
Can you show the perf's call graph sampling profiles of both the unpatched and
patched version, to confirm that the bottleneck is around catcache eviction and refill?
I checked it with perf record -avg and perf report.
The following shows top 20 symbols during benchmark including kernel space.
The main difference between master (unpatched) and patched one seems that
patched one consumes cpu catcache-evict-and-refill functions including
SearchCatCacheMiss(), CatalogCacheCreateEntry(), CatCacheCleanupOldEntries().
So it seems to me that these functions needs further inspection
to suppress the performace decline as much as possible
Master(%) master |patch (%) patch
51.25% cpu_startup_entry | 51.45% cpu_startup_entry
51.13% arch_cpu_idle | 51.19% arch_cpu_idle
51.13% default_idle | 51.19% default_idle
51.13% native_safe_halt | 50.95% native_safe_halt
36.27% PostmasterMain | 46.98% PostmasterMain
36.27% main | 46.98% main
36.27% __libc_start_main | 46.98% __libc_start_main
36.07% ServerLoop | 46.93% ServerLoop
35.75% PostgresMain | 46.89% PostgresMain
26.03% exec_simple_query | 45.99% exec_simple_query
26.00% rest_init | 43.40% SearchCatCacheMiss
26.00% start_kernel | 42.80% CatalogCacheCreateEntry
26.00% x86_64_start_reservations | 42.75% CatCacheCleanupOldEntries
26.00% x86_64_start_kernel | 27.04% rest_init
25.26% start_secondary | 27.04% start_kernel
10.25% pg_plan_queries | 27.04% x86_64_start_reservations
10.17% pg_plan_query | 27.04% x86_64_start_kernel
10.16% main | 24.42% start_secondary
10.16% __libc_start_main | 22.35% pg_analyze_and_rewrite
10.03% standard_planner | 22.35% parse_analyze
Regards,
Takeshi Ideriha
From: Ideriha, Takeshi/出利葉 健
I checked it with perf record -avg and perf report.
The following shows top 20 symbols during benchmark including kernel space.
The main difference between master (unpatched) and patched one seems that
patched one consumes cpu catcache-evict-and-refill functions including
SearchCatCacheMiss(), CatalogCacheCreateEntry(),
CatCacheCleanupOldEntries().
So it seems to me that these functions needs further inspection
to suppress the performace decline as much as possible
Thank you. It's good to see the expected functions, rather than strange behavior. The performance drop is natural just like the database cache's hit ratio is low. The remedy for performance by the user is also the same as the database cache -- increase the catalog cache.
Regards
Takayuki Tsunakawa
From: Robert Haas [mailto:robertmhaas@gmail.com]
That might be enough to justify having the parameter. But I'm not
quite sure how high the value would need to be set to actually get the
benefit in a case like that, or what happens if you set it to a value
that's not quite high enough. I think it might be good to play around
some more with cases like this, just to get a feeling for how much
time you can save in exchange for how much memory.
Why don't we consider this just like the database cache and other DBMS's dictionary caches? That is,
* If you want to avoid infinite memory bloat, set the upper limit on size.
* To find a better limit, check the hit ratio with the statistics view (based on Horiguchi-san's original 0004 patch, although that seems modification anyway)
Why do people try to get away from a familiar idea... Am I missing something?
Ideriha-san,
Could you try simplifying the v15 patch set to see how simple the code would look or not? That is:
* 0001: add dlist_push_tail() ... as is
* 0002: memory accounting, with correction based on feedback
* 0003: merge the original 0003 and 0005, with correction based on feedback
Regards
Takayuki Tsunakawa
On Tue, Feb 19, 2019 at 07:08:14AM +0000, Tsunakawa, Takayuki wrote:
We all have to manage things within resource constraints. The DBA
wants to make sure the server doesn't overuse memory to avoid crash
or slowdown due to swapping. Oracle does it, and another open source
database, MySQL, does it too. PostgreSQL does it with shared_buffers,
wal_buffers, and work_mem (within a single session). Then, I thought
it's natural to do it with catcache/relcache/plancache.
I already addressed these questions in an email from Feb 14:
/messages/by-id/20190214154955.GB19578@momjian.us
I understand the operational needs of limiting resources in some cases,
but there is also the history of OS's using working set to allocate
things, which didn't work too well:
https://en.wikipedia.org/wiki/Working_set
I think we need to address the most pressing problem of unlimited cache size
bloat and then take a holistic look at all memory allocation. If we
are going to address that in a global way, I don't see the relation
cache as the place to start.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
From: Tsunakawa, Takayuki
Ideriha-san,
Could you try simplifying the v15 patch set to see how simple the code would look or
not? That is:* 0001: add dlist_push_tail() ... as is
* 0002: memory accounting, with correction based on feedback
* 0003: merge the original 0003 and 0005, with correction based on feedback
Attached are simpler version based on Horiguchi san's ver15 patch,
which means cache is pruned by both time and size.
(Still cleanup function is complex but it gets much simpler.)
Regards,
Takeshi Ideriha
Attachments:
On Thu, Feb 21, 2019 at 1:38 AM Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
Why don't we consider this just like the database cache and other DBMS's dictionary caches? That is,
* If you want to avoid infinite memory bloat, set the upper limit on size.
* To find a better limit, check the hit ratio with the statistics view (based on Horiguchi-san's original 0004 patch, although that seems modification anyway)
Why do people try to get away from a familiar idea... Am I missing something?
I don't understand the idea that we would add something to PostgreSQL
without proving that it has value. Sure, other systems have somewhat
similar systems, and they have knobs to tune them. But, first, we
don't know that those other systems made all the right decisions, and
second, even they are, that doesn't mean that we'll derive similar
benefits in a system with a completely different code base and many
other internal differences.
You need to demonstrate that each and every GUC you propose to add has
a real, measurable benefit in some plausible scenario. You can't just
argue that other people have something kinda like this so we should
have it too. Or, well, you can argue that, but if you do, then -1
from me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
At Wed, 20 Feb 2019 13:09:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZXw+SwK_9Tp=wLqZDstW_X+Ant=rd7K+q4zmYONPuL=w@mail.gmail.com>
On Tue, Feb 19, 2019 at 11:15 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Difference from v15:
Removed AllocSet accounting stuff. We use approximate memory
size for catcache.Removed prune-by-number(or size) stuff.
Adressing comments from Tsunakawa-san and Ideriha-san .
Separated catcache monitoring feature. (Removed from this set)
(But it is crucial to check this feature...)Is this small enough ?
The commit message in 0002 says 'This also can put a hard limit on the
number of catcache entries.' but neither of the GUCs that you've
documented have that effect. Is that a leftover from a previous
version?
Mmm. Right. Thank you for pointing that and sorry for that. Fixed
it including another mistake in the commit message in my repo. It
will appear in the next version.
| Remove entries that haven't been used for a certain time
|
| Catcache entries can be left alone for several reasons. It is not
| desirable that they eat up memory. With this patch, entries that
| haven't been used for a certain time are considered to be removed
| before enlarging hash array.
I'd like to see some evidence that catalog_cache_memory_target has any
value, vs. just always setting it to zero. I came up with the
following somewhat artificial example that shows that it might have
value.rhaas=# create table foo (a int primary key, b text) partition by hash (a);
[rhaas pgsql]$ perl -e 'for (0..9999) { print "CREATE TABLE foo$_
PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; }'
| psqlFirst execution of 'select * from foo' in a brand new session takes
about 1.9 seconds; subsequent executions take about 0.7 seconds. So,
if catalog_cache_memory_target were set to a high enough value to
allow all of that stuff to remain in cache, we could possibly save
about 1.2 seconds coming off the blocks after a long idle period.
That might be enough to justify having the parameter. But I'm not
quite sure how high the value would need to be set to actually get the
benefit in a case like that, or what happens if you set it to a value
that's not quite high enough.
It is artificial (or acutually wont't be repeatedly executed in a
session) but anyway what can get benefit from
catalog_cache_memory_target would be a kind of extreme.
I think the two parameters are to be tuned in the following
steps.
- If the default setting sutisfies you, leave it alone. (as a
general suggestion)
- If you find your (syscache-sensitive) query are to be executed
with rather longer intervals, say 10-30 minutes, and it gets
slower than shorter intervals, consider increase
catalog_cache_prune_min_age to about the query interval. If you
don't suffer process-bloat, that's fine.
- If you find the process too much "bloat"s and you (intuirively)
suspect the cause is system cache, set it to certain shorter
value, say 1 minutes, and set the catalog_cache_memory_target
to allowable amount of memory for each process. The memory
usage will be stable at (un)certain amount above the target.
Or, if you want determine the setting previously with rather
strict limit, and if the monitoring feature were a part of this
patchset, a user can check how much memory is used for the query.
$ perl -e 'print "set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) { print "CREATE TABLE foo$_ PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size) from pg_stat_syscache";' | psql
sum
---------
7088523
In this case, set catalog_cache_memory_target to 7MB and
catalog_cache_memory_target to '1min'. Since the target doesn't
work strictly (checked only at every resizing time), possibly
you need further tuning.
that's not quite high enough. I think it might be good to play around
some more with cases like this, just to get a feeling for how much
time you can save in exchange for how much memory.
All kind of tuning is something of that kind, I think.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 25 Feb 2019 15:23:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190225.152322.104148315.horiguchi.kyotaro@lab.ntt.co.jp>
I think the two parameters are to be tuned in the following
steps.- If the default setting sutisfies you, leave it alone. (as a
general suggestion)- If you find your (syscache-sensitive) query are to be executed
with rather longer intervals, say 10-30 minutes, and it gets
slower than shorter intervals, consider increase
catalog_cache_prune_min_age to about the query interval. If you
don't suffer process-bloat, that's fine.- If you find the process too much "bloat"s and you (intuirively)
suspect the cause is system cache, set it to certain shorter
value, say 1 minutes, and set the catalog_cache_memory_target
to allowable amount of memory for each process. The memory
usage will be stable at (un)certain amount above the target.Or, if you want determine the setting previously with rather
strict limit, and if the monitoring feature were a part of this
patchset, a user can check how much memory is used for the query.$ perl -e 'print "set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) { print "CREATE TABLE foo$_ PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size) from pg_stat_syscache";' | psql
sum
---------
7088523
It's not substantial, but the number is for
catalog_cache_prune_min_age = 300s, I had 12MB when it is
disabled.
perl -e 'print "set catalog_cache_prune_min_age to 0; set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) { print "CREATE TABLE foo$_ PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size) from pg_stat_syscache";' | psql
sum
----------
12642321
In this case, set catalog_cache_memory_target to 7MB and
catalog_cache_memory_target to '1min'. Since the target doesn't
work strictly (checked only at every resizing time), possibly
you need further tuning.
regards.
-
Kyotaro Horiguchi
NTT Open Source Software Center
Import Notes
Reply to msg id not found: 20190225.152322.104148315.horiguchi.kyotaro@lab.ntt.co.jp
From: Robert Haas [mailto:robertmhaas@gmail.com]
I don't understand the idea that we would add something to PostgreSQL
without proving that it has value. Sure, other systems have somewhat
similar systems, and they have knobs to tune them. But, first, we
don't know that those other systems made all the right decisions, and
second, even they are, that doesn't mean that we'll derive similar
benefits in a system with a completely different code base and many
other internal differences.
I understand that general idea. So, I don't have an idea why the proposed approach, eviction based only on elapsed time only at hash table expansion, is better for PostgreSQL's code base and other internal differences...
You need to demonstrate that each and every GUC you propose to add has
a real, measurable benefit in some plausible scenario. You can't just
argue that other people have something kinda like this so we should
have it too. Or, well, you can argue that, but if you do, then -1
from me.
The benefit of the size limit are:
* Controllable and predictable memory usage. The DBA can be sure that OOM won't happen.
* Smoothed (non-abnormal) transaction response time. This is due to the elimination of bulk eviction of cache entries.
I'm not sure how to tune catalog_cache_prune_min_age and catalog_cache_memory_target. Let me pick up a test scenario in a later mail in response to Horiguchi-san.
Regards
Takayuki Tsunakawa
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
- If you find the process too much "bloat"s and you (intuirively)
suspect the cause is system cache, set it to certain shorter
value, say 1 minutes, and set the catalog_cache_memory_target
to allowable amount of memory for each process. The memory
usage will be stable at (un)certain amount above the target.
Could you guide me how to tune these parameters in an example scenario? Let me take the original problematic case referenced at the beginning of this thread. That is:
* A PL/pgSQL function that creates a temp table, accesses it, (accesses other non-temp tables), and drop the temp table.
* An application repeatedly begins a transaction, calls the stored function, and commits the transaction.
With v16 patch applied, and leaving the catalog_cache_xxx parameters set to their defaults, CacheMemoryContext continued to increase as follows:
CacheMemoryContext: 1065016 total in 9 blocks; 104168 free (17 chunks); 960848 used
CacheMemoryContext: 8519736 total in 12 blocks; 3765504 free (19 chunks); 4754232 used
CacheMemoryContext: 25690168 total in 14 blocks; 8372096 free (21 chunks); 17318072 used
CacheMemoryContext: 42991672 total in 16 blocks; 11741024 free (21761 chunks); 31250648 used
How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?
I'm afraid that once the catcache hash table becomes large in a short period, the eviction would happen less frequently, leading to memory bloat.
Regards
Takayuki Tsunakawa
From: Tsunakawa, Takayuki
Ideriha-san,
Could you try simplifying the v15 patch set to see how simple the code
would look or not? That is:* 0001: add dlist_push_tail() ... as is
* 0002: memory accounting, with correction based on feedback
* 0003: merge the original 0003 and 0005, with correction based on
feedbackAttached are simpler version based on Horiguchi san's ver15 patch, which means
cache is pruned by both time and size.
(Still cleanup function is complex but it gets much simpler.)
I don't mean to disregard what Horiguchi san and others have developed and discussed.
But I refactored again the v15 patch to reduce complexity of v15 patch
because it seems to me one of the reason for dropping feature for pruning by size stems from
code complexity.
Another thing is there's been discussed about over memory accounting overhead but
the overhead effect hasn't been measured in this thread. So I'd like to measure it.
Regards,
Takeshi Ideriha
Attachments:
On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?
As Tom has said before and will probably say again, I don't think you
actually want that. We know that PostgreSQL gets roughly 100x slower
with the system caches disabled - try running with
CLOBBER_CACHE_ALWAYS. If you are accessing the same system cache
entries repeatedly in a loop - which is not at all an unlikely
scenario, just run the same query or sequence of queries in a loop -
and if the number of entries exceeds 10MB even, perhaps especially, by
just a tiny bit, you are going to see a massive performance hit.
Maybe it won't be 100x because some more-commonly-used entries will
always stay cached, but it's going to be really big, I think.
Now you could say - well it's still better than running out of memory.
However, memory usage is quite unpredictable. It depends on how many
backends are active and how many copies of work_mem and/or
maintenance_work_mem are in use, among other things. I don't think we
can say that just imposing a limit on the size of the system caches is
going to be enough to reliably prevent an out of memory condition
unless the other use of memory on the machine happens to be extremely
stable.
So I think what's going to happen if you try to impose a hard-limit on
the size of the system cache is that you will cause some workloads to
slow down by 3x or more without actually preventing out of memory
conditions. What you need to do is accept that system caches need to
grow as big as they need to grow, and if that causes you to run out of
memory, either buy more memory or reduce the number of concurrent
sessions you allow. It would be fine to instead limit the cache
memory if those cache entries only had a mild effect on performance,
but I don't think that's the case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I'd like to see some evidence that catalog_cache_memory_target has any
value, vs. just always setting it to zero.It is artificial (or acutually wont't be repeatedly executed in a
session) but anyway what can get benefit from
catalog_cache_memory_target would be a kind of extreme.
I agree. So then let's not have it.
We shouldn't add more mechanism here than actually has value. It
seems pretty clear that keeping cache entries that go unused for long
periods can't be that important; even if we need them again
eventually, reloading them every 5 or 10 minutes can't hurt that much.
On the other hand, I think it's also pretty clear that evicting cache
entries that are being used frequently will have disastrous effects on
performance; as I noted in the other email I just sent, consider the
effects of CLOBBER_CACHE_ALWAYS. No reasonable user is going to want
to incur a massive slowdown to save a little bit of memory.
I see that *in theory* there is a value to
catalog_cache_memory_target, because *maybe* there is a workload where
tuning that GUC will lead to better performance at lower memory usage
than any competing proposal. But unless we can actually see an
example of such a workload, which so far I don't, we're adding a knob
that everybody has to think about how to tune when in fact we have no
idea how to tune it or whether it even needs to be tuned. That
doesn't make sense. We have to be able to document the parameters we
have and explain to users how they should be used. And as far as this
parameter is concerned I think we are not at that point.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
* 0001: add dlist_push_tail() ... as is
* 0002: memory accounting, with correction based on feedback
* 0003: merge the original 0003 and 0005, with correction based on
feedbackAttached are simpler version based on Horiguchi san's ver15 patch,
which means cache is pruned by both time and size.
(Still cleanup function is complex but it gets much simpler.)I don't mean to disregard what Horiguchi san and others have developed and
discussed.
But I refactored again the v15 patch to reduce complexity of v15 patch because it
seems to me one of the reason for dropping feature for pruning by size stems from
code complexity.Another thing is there's been discussed about over memory accounting overhead but
the overhead effect hasn't been measured in this thread. So I'd like to measure it.
I measured the memory context accounting overhead using Tomas's tool palloc_bench,
which he made it a while ago in the similar discussion.
/messages/by-id/53F7E83C.3020304@fuzzy.cz
This tool is a little bit outdated so I fixed it but basically I followed him.
Things I did:
- make one MemoryContext
- run both palloc() and pfree() for 32kB area 1,000,000 times.
- And measure this time
The result shows that master is 30 times faster than patched one.
So as Andres mentioned in upper thread it seems it has overhead.
[master (without v15 patch)]
61.52 ms
60.96 ms
61.40 ms
61.42 ms
61.14 ms
[with v15 patch]
1838.02 ms
1754.84 ms
1755.83 ms
1789.69 ms
1789.44 ms
Regards,
Takeshi Ideriha
From: Robert Haas [mailto:robertmhaas@gmail.com]
On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?
As Tom has said before and will probably say again, I don't think you actually want that.
We know that PostgreSQL gets roughly 100x slower with the system caches disabled
- try running with CLOBBER_CACHE_ALWAYS. If you are accessing the same system
cache entries repeatedly in a loop - which is not at all an unlikely scenario, just run the
same query or sequence of queries in a loop - and if the number of entries exceeds
10MB even, perhaps especially, by just a tiny bit, you are going to see a massive
performance hit.
Maybe it won't be 100x because some more-commonly-used entries will always stay
cached, but it's going to be really big, I think.Now you could say - well it's still better than running out of memory.
However, memory usage is quite unpredictable. It depends on how many backends
are active and how many copies of work_mem and/or maintenance_work_mem are in
use, among other things. I don't think we can say that just imposing a limit on the
size of the system caches is going to be enough to reliably prevent an out of memory
condition unless the other use of memory on the machine happens to be extremely
stable.
So I think what's going to happen if you try to impose a hard-limit on the size of the
system cache is that you will cause some workloads to slow down by 3x or more
without actually preventing out of memory conditions. What you need to do is accept
that system caches need to grow as big as they need to grow, and if that causes you
to run out of memory, either buy more memory or reduce the number of concurrent
sessions you allow. It would be fine to instead limit the cache memory if those cache
entries only had a mild effect on performance, but I don't think that's the case.
I'm afraid I may be quibbling about it.
What about users who understand performance drops but don't want to
add memory or decrease concurrency?
I think that PostgreSQL has a parameter
which most of users don't mind and use is as default
but a few of users want to change it.
In this case as you said, introducing hard limit parameter causes
performance decrease significantly so how about adding detailed caution
to the document like planner cost parameter?
Regards,
Takeshi Ideriha
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
I measured the memory context accounting overhead using Tomas's tool
palloc_bench,
which he made it a while ago in the similar discussion.
/messages/by-id/53F7E83C.3020304@fuzzy.czThis tool is a little bit outdated so I fixed it but basically I followed
him.
Things I did:
- make one MemoryContext
- run both palloc() and pfree() for 32kB area 1,000,000 times.
- And measure this timeThe result shows that master is 30 times faster than patched one.
So as Andres mentioned in upper thread it seems it has overhead.[master (without v15 patch)]
61.52 ms
60.96 ms
61.40 ms
61.42 ms
61.14 ms[with v15 patch]
1838.02 ms
1754.84 ms
1755.83 ms
1789.69 ms
1789.44 ms
I'm afraid the measurement is not correct. First, the older discussion below shows that the accounting overhead is much, much smaller, even with a more complex accounting.
9.5: Better memory accounting, towards memory-bounded HashAg
/messages/by-id/1407012053.15301.53.camel@jeff-desktop
Second, allocation/free of memory > 8 KB calls malloc()/free(). I guess the accounting overhead will be more likely to be hidden under the overhead of malloc() and free(). What we'd like to know the overhead when malloc() and free() are not called.
And are you sure you didn't enable assert checking?
Regards
Takayuki Tsunakawa
From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]I measured the memory context accounting overhead using Tomas's tool
palloc_bench, which he made it a while ago in the similar discussion.
/messages/by-id/53F7E83C.3020304@fuzzy.czThis tool is a little bit outdated so I fixed it but basically I
followed him.
Things I did:
- make one MemoryContext
- run both palloc() and pfree() for 32kB area 1,000,000 times.
- And measure this time
And are you sure you didn't enable assert checking?
Ah, sorry.. I misconfigured it.
I'm afraid the measurement is not correct. First, the older discussion below shows
that the accounting overhead is much, much smaller, even with a more complex
accounting.
Second, allocation/free of memory > 8 KB calls malloc()/free(). I guess the
accounting overhead will be more likely to be hidden under the overhead of malloc()
and free(). What we'd like to know the overhead when malloc() and free() are not
called.
Here is the average of 50 times measurement.
Palloc-pfree for 800byte with 1,000,000 times, and 32kB with 1,000,000 times.
I checked malloc is not called at size=800 using gdb.
[Size=800, iter=1,000,000]
Master |15.763
Patched|16.262 (+3%)
[Size=32768, iter=1,000,000]
Master |61.3076
Patched|62.9566 (+2%)
At least compared to previous HashAg version, the overhead is smaller.
It has some overhead but is increase by 2 or 3% a little bit?
Regards,
Takeshi Ideriha
On Wed, Feb 27, 2019 at 3:16 AM Ideriha, Takeshi
<ideriha.takeshi@jp.fujitsu.com> wrote:
I'm afraid I may be quibbling about it.
What about users who understand performance drops but don't want to
add memory or decrease concurrency?
I think that PostgreSQL has a parameter
which most of users don't mind and use is as default
but a few of users want to change it.
In this case as you said, introducing hard limit parameter causes
performance decrease significantly so how about adding detailed caution
to the document like planner cost parameter?
There's nothing wrong with a parameter that is useful to some people
and harmless to everyone else, but the people who are proposing that
parameter still have to demonstrate that it has those properties.
This email thread is really short on clear demonstrations that X or Y
is useful.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Ideriha, Takeshi/出利葉 健
[Size=800, iter=1,000,000]
Master |15.763
Patched|16.262 (+3%)[Size=32768, iter=1,000,000]
Master |61.3076
Patched|62.9566 (+2%)
What's the unit, second or millisecond?
Why is the number of digits to the right of the decimal point?
Is the measurement correct? I'm wondering because the difference is larger in the latter case. Isn't the accounting processing almost the sane in both cases?
* former: 16.262 - 15.763 = 4.99
* latter: 62.956 - 61.307 = 16.49
At least compared to previous HashAg version, the overhead is smaller.
It has some overhead but is increase by 2 or 3% a little bit?
I think the overhead is sufficiently small. It may get even smaller with a trivial tweak.
You added the new member usedspace at the end of MemoryContextData. The original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 64 bytes. So, the new member will be on a separate cache line. Try putting usedspace before the name member.
Regards
Takayuki Tsunakawa
Robert> This email thread is really short on clear demonstrations that X or Y
Robert> is useful.
It is useful when the whole database does **not** crash, isn't it?
Case A (==current PostgeSQL mode): syscache grows, then OOMkiller
chimes in, kills the database process, and it leads to the complete
cluster failure (all other PG processes terminate themselves).
Case B (==limit syscache by 10MiB or whatever as Tsunakawa, Takayuki
asks): a single ill-behaved process works a bit slower and/or
consumers more CPU than the other ones. The whole DB is still alive.
I'm quite sure "case B" is much better for the end users and for the
database administrators.
So, +1 to Tsunakawa, Takayuki, it would be so great if there was a way
to limit the memory consumption of a single process (e.g. syscache,
workmem, etc, etc).
Robert> However, memory usage is quite unpredictable. It depends on how many
Robert> backends are active
The number of backends can be limited by ensuring a proper limits at
application connection pool level and/or pgbouncer and/or things like
that.
Robert>how many copies of work_mem and/or
Robert> maintenance_work_mem are in use
There might be other patches to cap the total use of
work_mem/maintenance_work_mem,
Robert>I don't think we
Robert> can say that just imposing a limit on the size of the system caches is
Robert> going to be enough to reliably prevent an out of memory condition
The less possibilities there are for OOM the better. Quite often it is
much better to fail a single SQL rather than kill all the DB
processes.
Vladimir
At Tue, 26 Feb 2019 10:55:18 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa2b-LUF9h3wugD9ZA5MP0xyu2kJYHC9L6sdLywNSmhBQ@mail.gmail.com>
On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:I'd like to see some evidence that catalog_cache_memory_target has any
value, vs. just always setting it to zero.It is artificial (or acutually wont't be repeatedly executed in a
session) but anyway what can get benefit from
catalog_cache_memory_target would be a kind of extreme.I agree. So then let's not have it.
Ah... Yeah! I see. Andres' concern was that crucial syscache
entries might be blown away during a long idle time. If that
happens, it's enough to just turn off in the almost all of such
cases.
We no longer need to count memory usage without the feature. That
sutff is moved to monitoring feature, which is out of the scope
of the current status of this patch.
We shouldn't add more mechanism here than actually has value. It
seems pretty clear that keeping cache entries that go unused for long
periods can't be that important; even if we need them again
eventually, reloading them every 5 or 10 minutes can't hurt that much.
On the other hand, I think it's also pretty clear that evicting cache
entries that are being used frequently will have disastrous effects on
performance; as I noted in the other email I just sent, consider the
effects of CLOBBER_CACHE_ALWAYS. No reasonable user is going to want
to incur a massive slowdown to save a little bit of memory.I see that *in theory* there is a value to
catalog_cache_memory_target, because *maybe* there is a workload where
tuning that GUC will lead to better performance at lower memory usage
than any competing proposal. But unless we can actually see an
example of such a workload, which so far I don't, we're adding a knob
that everybody has to think about how to tune when in fact we have no
idea how to tune it or whether it even needs to be tuned. That
doesn't make sense. We have to be able to document the parameters we
have and explain to users how they should be used. And as far as this
parameter is concerned I think we are not at that point.
In the attached v18,
catalog_cache_memory_target is removed,
removed some leftover of removing the hard limit feature,
separated catcache clock update during a query into 0003.
attached 0004 (monitor part) in order just to see how it is working.
v18-0001-Add-dlist_move_tail:
Just adds dlist_move_tail
v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-:
Revised pruning feature.
====
v18-0003-Asynchronous-update-of-catcache-clock:
Separated catcache clock update feature.
v18-0004-Syscache-usage-tracking-feature:
Usage tracking feature.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
[Size=800, iter=1,000,000]
Master |15.763
Patched|16.262 (+3%)[Size=32768, iter=1,000,000]
Master |61.3076
Patched|62.9566 (+2%)What's the unit, second or millisecond?
Millisecond.
Why is the number of digits to the right of the decimal point?
Is the measurement correct? I'm wondering because the difference is larger in the
latter case. Isn't the accounting processing almost the same in both cases?
* former: 16.262 - 15.763 = 4.99
* latter: 62.956 - 61.307 = 16.49
I think the overhead is sufficiently small. It may get even smaller with a trivial tweak.You added the new member usedspace at the end of MemoryContextData. The
original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 64 bytes.
So, the new member will be on a separate cache line. Try putting usedspace before
the name member.
OK. I changed the order of MemoryContextData members to fit usedspace into one cacheline.
I disabled all the catcache eviction mechanism in patched one and compared it with master
to investigate that overhead of memory accounting become small enough.
The settings are almost same as the last email.
But last time the number of trials was 50 so I increased it and tried 5000 times to
calculate the average figure (rounded off to three decimal place).
[Size=800, iter=1,000,000]
Master |15.64 ms
Patched|16.26 ms (+4%)
The difference is 0.62ms
[Size=32768, iter=1,000,000]
Master |61.39 ms
Patched|60.99 ms (-1%)
I guess there is around 2% noise.
But based on this experiment it seems the overhead small.
Still there is some overhead but it can be covered by some other
manipulation like malloc().
Does this result show that hard-limit size option with memory accounting
doesn't harm to usual users who disable hard limit size option?
Regards,
Takeshi Ideriha
Attachments:
Hello.
At Mon, 4 Mar 2019 03:03:51 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F44564E@G01JPEXMBKW04>
Does this result show that hard-limit size option with memory accounting
doesn't harm to usual users who disable hard limit size option?
Not sure, but 4% seems beyond noise level. Planner requests
mainly smaller allocation sizes especially for list
operations. If we implement it for slab allocator, the
degradation would be more significant.
We *are* suffering from endless bloat of system cache (and some
other stuffs) and there is no way to deal with it. The soft limit
feature actually eliminates the problem with no degradation and
even accelerates execution in some cases.
Infinite bloat is itself a problem, but if the processes just
needs more but finite size of memory, just additional memory or
less max_connections is enough.
What Andres and Robert suggested is we need more convincing
reason for the hard limit feature other than "some is wanting
it". The degradation of the crude accounting stuff is not the
primary issue here. I think.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Mar 1, 2019 at 3:33 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
It is artificial (or acutually wont't be repeatedly executed in a
session) but anyway what can get benefit from
catalog_cache_memory_target would be a kind of extreme.I agree. So then let's not have it.
Ah... Yeah! I see. Andres' concern was that crucial syscache
entries might be blown away during a long idle time. If that
happens, it's enough to just turn off in the almost all of such
cases.
+1.
In the attached v18,
catalog_cache_memory_target is removed,
removed some leftover of removing the hard limit feature,
separated catcache clock update during a query into 0003.
attached 0004 (monitor part) in order just to see how it is working.v18-0001-Add-dlist_move_tail:
Just adds dlist_move_tailv18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-:
Revised pruning feature.
OK, so this is getting simpler, but I'm wondering why we need
dlist_move_tail() at all. It is a well-known fact that maintaining
LRU ordering is expensive and it seems to be unnecessary for our
purposes here. Can't CatCacheCleanupOldEntries just use a single-bit
flag on the entry? If the flag is set, clear it. If the flag is
clear, drop the entry. When an entry is used, set the flag. Then,
entries will go away if they are not used between consecutive calls to
CatCacheCleanupOldEntries. Sure, that might be slightly less accurate
in terms of which entries get thrown away, but I bet it makes no real
difference.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
OK, so this is getting simpler, but I'm wondering why we need
dlist_move_tail() at all. It is a well-known fact that maintaining
LRU ordering is expensive and it seems to be unnecessary for our
purposes here.
Yeah ... LRU maintenance was another thing that used to be in the
catcache logic and was thrown out as far too expensive. Your idea
of just using a clock sweep instead seems plausible.
regards, tom lane
On 3/6/19 9:17 PM, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
OK, so this is getting simpler, but I'm wondering why we need
dlist_move_tail() at all. It is a well-known fact that maintaining
LRU ordering is expensive and it seems to be unnecessary for our
purposes here.Yeah ... LRU maintenance was another thing that used to be in the
catcache logic and was thrown out as far too expensive. Your idea
of just using a clock sweep instead seems plausible.
I agree clock sweep might be sufficient, although the benchmarks done in
this thread so far do not suggest the LRU approach is very expensive.
A simple true/false flag, as proposed by Robert, would mean we can only
do the cleanup once per the catalog_cache_prune_min_age interval, so
with the default value (5 minutes) the entries might be between 5 and 10
minutes old. That's probably acceptable, although for higher values the
range gets wider and wider ...
Which part of the LRU approach is supposedly expensive? Updating the
lastaccess field or moving the entries to tail? I'd guess it's the
latter, so perhaps we can keep some sort of age field, update it less
frequently (once per minute?), and do the clock sweep?
BTW wasn't one of the cases this thread aimed to improve a session that
accesses a lot of objects in a short period of time? That balloons the
syscache, and while this patch evicts the entries from memory, we never
actually release the memory back (because AllocSet just moves it into
the freelists) and it's unlikely to get swapped out (because other
chunks on those memory pages are likely to be still used). I've proposed
to address that by recreating the context if it gets too bloated, and I
think Alvaro agreed with that. But I haven't seen any further discussion
about that.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I agree clock sweep might be sufficient, although the benchmarks done in
this thread so far do not suggest the LRU approach is very expensive.
I'm not sure how thoroughly it's been tested -- has someone
constructed a benchmark that does a lot of syscache lookups and
measured how much slower they get with this new code?
A simple true/false flag, as proposed by Robert, would mean we can only
do the cleanup once per the catalog_cache_prune_min_age interval, so
with the default value (5 minutes) the entries might be between 5 and 10
minutes old. That's probably acceptable, although for higher values the
range gets wider and wider ...
That's true, but I don't know that it matters. I'm not sure there's
much of a use case for raising this parameter to some larger value,
but even if there is, is it really worth complicating the mechanism to
make sure that we throw away entries in a more timely fashion? That's
not going to be cost-free, either in terms of CPU cycles or in terms
of code complexity.
Again, I think our goal should be to add the least mechanism here that
solves the problem. If we can show that a true/false flag makes poor
decisions about which entries to evict and a smarter algorithm does
better, then it's worth considering. However, my bet is that it makes
no meaningful difference.
Which part of the LRU approach is supposedly expensive? Updating the
lastaccess field or moving the entries to tail? I'd guess it's the
latter, so perhaps we can keep some sort of age field, update it less
frequently (once per minute?), and do the clock sweep?
Move to tail (although lastaccess would be expensive if too if it
involves an extra gettimeofday() call). GCLOCK, like we use for
shared_buffers, is a common approximation of LRU which tends to be a
lot less expensive to implement. We could do that here and it might
work well, but I think the question, again, is whether we really need
it. I think our goal here should just be to jettison cache entries
that are clearly worthless. It's expensive enough to reload cache
entries that any kind of aggressive eviction policy is probably a
loser, and if our goal is just to get rid of the stuff that's clearly
not being used, we don't need to be super-accurate about it.
BTW wasn't one of the cases this thread aimed to improve a session that
accesses a lot of objects in a short period of time? That balloons the
syscache, and while this patch evicts the entries from memory, we never
actually release the memory back (because AllocSet just moves it into
the freelists) and it's unlikely to get swapped out (because other
chunks on those memory pages are likely to be still used). I've proposed
to address that by recreating the context if it gets too bloated, and I
think Alvaro agreed with that. But I haven't seen any further discussion
about that.
That's an interesting point. It seems reasonable to me to just throw
away everything and release all memory if the session has been idle
for a while, but if the session is busy doing stuff, discarding
everything in bulk like that is going to cause latency spikes.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 3/7/19 3:34 PM, Robert Haas wrote:
On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I agree clock sweep might be sufficient, although the benchmarks done in
this thread so far do not suggest the LRU approach is very expensive.I'm not sure how thoroughly it's been tested -- has someone
constructed a benchmark that does a lot of syscache lookups and
measured how much slower they get with this new code?
What I've done on v13 (and I don't think the results would be that
different on the current patch, but I may rerun it if needed) is a test
that creates large number of tables (up to 1M) and then accesses them
randomly. I don't know if it matches what you imagine, but see [1]
/messages/by-id/74386116-0bc5-84f2-e614-0cff19aca2de@2ndquadrant.com
I don't think this shows any regression, but perhaps we should do a
microbenchmark isolating the syscache entirely?
A simple true/false flag, as proposed by Robert, would mean we can only
do the cleanup once per the catalog_cache_prune_min_age interval, so
with the default value (5 minutes) the entries might be between 5 and 10
minutes old. That's probably acceptable, although for higher values the
range gets wider and wider ...That's true, but I don't know that it matters. I'm not sure there's
much of a use case for raising this parameter to some larger value,
but even if there is, is it really worth complicating the mechanism to
make sure that we throw away entries in a more timely fashion? That's
not going to be cost-free, either in terms of CPU cycles or in terms
of code complexity.
True, although it very much depends on how expensive it would be.
Again, I think our goal should be to add the least mechanism here that
solves the problem. If we can show that a true/false flag makes poor
decisions about which entries to evict and a smarter algorithm does
better, then it's worth considering. However, my bet is that it makes
no meaningful difference.
True.
Which part of the LRU approach is supposedly expensive? Updating the
lastaccess field or moving the entries to tail? I'd guess it's the
latter, so perhaps we can keep some sort of age field, update it less
frequently (once per minute?), and do the clock sweep?Move to tail (although lastaccess would be expensive if too if it
involves an extra gettimeofday() call). GCLOCK, like we use for
shared_buffers, is a common approximation of LRU which tends to be a
lot less expensive to implement. We could do that here and it might
work well, but I think the question, again, is whether we really need
it. I think our goal here should just be to jettison cache entries
that are clearly worthless. It's expensive enough to reload cache
entries that any kind of aggressive eviction policy is probably a
loser, and if our goal is just to get rid of the stuff that's clearly
not being used, we don't need to be super-accurate about it.
True.
BTW wasn't one of the cases this thread aimed to improve a session that
accesses a lot of objects in a short period of time? That balloons the
syscache, and while this patch evicts the entries from memory, we never
actually release the memory back (because AllocSet just moves it into
the freelists) and it's unlikely to get swapped out (because other
chunks on those memory pages are likely to be still used). I've proposed
to address that by recreating the context if it gets too bloated, and I
think Alvaro agreed with that. But I haven't seen any further discussion
about that.That's an interesting point. It seems reasonable to me to just throw
away everything and release all memory if the session has been idle
for a while, but if the session is busy doing stuff, discarding
everything in bulk like that is going to cause latency spikes.
What I had in mind is more along these lines:
(a) track number of active syscache entries (increment when adding a new
one, decrement when evicting one)
(b) track peak number of active syscache entries
(c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a
memory context swap, i.e. create a new context, copy active entries over
and destroy the old one
That would at least free() the memory. Of course, the syscache entries
may have different sizes, so tracking just numbers of entries is just an
approximation. But I think it'd be enough.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Which part of the LRU approach is supposedly expensive? Updating the
lastaccess field or moving the entries to tail? I'd guess it's the
latter, so perhaps we can keep some sort of age field, update it less
frequently (once per minute?), and do the clock sweep?
Move to tail (although lastaccess would be expensive if too if it
involves an extra gettimeofday() call).
As I recall, the big problem with the old LRU code was loss of
locality of access, in that in addition to the data associated with
hot syscache entries, you were necessarily also touching list link
fields associated with not-hot entries. That's bad for the CPU cache.
A gettimeofday call (or any other kernel call) per syscache access
would be a complete disaster.
regards, tom lane
On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I don't think this shows any regression, but perhaps we should do a
microbenchmark isolating the syscache entirely?
Well, if we need the LRU list, then yeah I think a microbenchmark
would be a good idea to make sure we really understand what the impact
of that is going to be. But if we don't need it and can just remove
it then we don't.
What I had in mind is more along these lines:
(a) track number of active syscache entries (increment when adding a new
one, decrement when evicting one)(b) track peak number of active syscache entries
(c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a
memory context swap, i.e. create a new context, copy active entries over
and destroy the old oneThat would at least free() the memory. Of course, the syscache entries
may have different sizes, so tracking just numbers of entries is just an
approximation. But I think it'd be enough.
Yeah, that could be done. I'm not sure how expensive it would be, and
I'm also not sure how much more effective it would be than what's
currently proposed in terms of actually freeing memory. If you free
enough dead syscache entries, you might manage to give some memory
back to the OS: after all, there may be some locality there. And even
if you don't, you'll at least prevent further growth, which might be
good enough.
We could consider doing some version of what has been proposed here
and the thing you're proposing here could later be implemented on top
of that. I mean, evicting entries at all is a prerequisite to
copy-and-compact.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 3/7/19 4:01 PM, Robert Haas wrote:
On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I don't think this shows any regression, but perhaps we should do a
microbenchmark isolating the syscache entirely?Well, if we need the LRU list, then yeah I think a microbenchmark
would be a good idea to make sure we really understand what the impact
of that is going to be. But if we don't need it and can just remove
it then we don't.What I had in mind is more along these lines:
(a) track number of active syscache entries (increment when adding a new
one, decrement when evicting one)(b) track peak number of active syscache entries
(c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a
memory context swap, i.e. create a new context, copy active entries over
and destroy the old oneThat would at least free() the memory. Of course, the syscache entries
may have different sizes, so tracking just numbers of entries is just an
approximation. But I think it'd be enough.Yeah, that could be done. I'm not sure how expensive it would be, and
I'm also not sure how much more effective it would be than what's
currently proposed in terms of actually freeing memory. If you free
enough dead syscache entries, you might manage to give some memory
back to the OS: after all, there may be some locality there. And even
if you don't, you'll at least prevent further growth, which might be
good enough.
I have my doubts about that happening in practice. It might happen for
some workloads, but I think the locality is rather unpredictable.
We could consider doing some version of what has been proposed here
and the thing you're proposing here could later be implemented on top
of that. I mean, evicting entries at all is a prerequisite to
copy-and-compact.
Sure. I'm not saying the patch must do this to make it committable.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Robert Haas [mailto:robertmhaas@gmail.com]
On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:I don't think this shows any regression, but perhaps we should do a
microbenchmark isolating the syscache entirely?Well, if we need the LRU list, then yeah I think a microbenchmark would be a good idea
to make sure we really understand what the impact of that is going to be. But if we
don't need it and can just remove it then we don't.
Just to be sure, we introduced the LRU list in this thread to find the entries less than threshold time
without scanning the whole hash table. If hash table becomes large without LRU list, scanning time becomes slow.
Regards,
Takeshi Ideriha
From: Vladimir Sitnikov [mailto:sitnikov.vladimir@gmail.com]
Robert> This email thread is really short on clear demonstrations that X
Robert> or Y is useful.It is useful when the whole database does **not** crash, isn't it?
Case A (==current PostgeSQL mode): syscache grows, then OOMkiller chimes in, kills
the database process, and it leads to the complete cluster failure (all other PG
processes terminate themselves).Case B (==limit syscache by 10MiB or whatever as Tsunakawa, Takayuki
asks): a single ill-behaved process works a bit slower and/or consumers more CPU
than the other ones. The whole DB is still alive.I'm quite sure "case B" is much better for the end users and for the database
administrators.So, +1 to Tsunakawa, Takayuki, it would be so great if there was a way to limit the
memory consumption of a single process (e.g. syscache, workmem, etc, etc).Robert> However, memory usage is quite unpredictable. It depends on how
Robert> many backends are activeThe number of backends can be limited by ensuring a proper limits at application
connection pool level and/or pgbouncer and/or things like that.Robert>how many copies of work_mem and/or maintenance_work_mem are in
Robert>useThere might be other patches to cap the total use of
work_mem/maintenance_work_mem,Robert>I don't think we
Robert> can say that just imposing a limit on the size of the system
Robert>caches is going to be enough to reliably prevent an out of
Robert>memory conditionThe less possibilities there are for OOM the better. Quite often it is much better to fail
a single SQL rather than kill all the DB processes.
Yeah, I agree. This limit would be useful for such extreme situation.
Regards,
Takeshi Ideriha
On Thu, Mar 7, 2019 at 11:40 PM Ideriha, Takeshi
<ideriha.takeshi@jp.fujitsu.com> wrote:
Just to be sure, we introduced the LRU list in this thread to find the entries less than threshold time
without scanning the whole hash table. If hash table becomes large without LRU list, scanning time becomes slow.
Hmm. So, it's a trade-off, right? One option is to have an LRU list,
which imposes a small overhead on every syscache or catcache operation
to maintain the LRU ordering. The other option is to have no LRU
list, which imposes a larger overhead every time we clean up the
syscaches. My bias is toward thinking that the latter is better,
because:
1. Not everybody is going to use this feature, and
2. Syscache cleanup should be something that only happens every so
many minutes, and probably while the backend is otherwise idle,
whereas lookups can happen many times per millisecond.
However, perhaps someone will provide some evidence that casts a
different light on the situation.
I don't see much point in continuing to review this patch at this
point. There's been no new version of the patch in 3 weeks, and there
is -- in my view at least -- a rather frustrating lack of evidence
that the complexity this patch introduces is actually beneficial. No
matter how many people +1 the idea of making this more complicated, it
can't be justified unless you can provide a test result showing that
the additional complexity solves a problem that does not get solved
without that complexity. And even then, who is going to commit a
patch that uses a design which Tom Lane says was tried before and
stunk?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
At Mon, 25 Mar 2019 09:28:57 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaViV7gFtAiivfBdBZkumvH3_Gey-4G8PF0KHncQSZ_Jw@mail.gmail.com>
On Thu, Mar 7, 2019 at 11:40 PM Ideriha, Takeshi
<ideriha.takeshi@jp.fujitsu.com> wrote:Just to be sure, we introduced the LRU list in this thread to find the entries less than threshold time
without scanning the whole hash table. If hash table becomes large without LRU list, scanning time becomes slow.Hmm. So, it's a trade-off, right? One option is to have an LRU list,
which imposes a small overhead on every syscache or catcache operation
to maintain the LRU ordering. The other option is to have no LRU
list, which imposes a larger overhead every time we clean up the
syscaches. My bias is toward thinking that the latter is better,
because:1. Not everybody is going to use this feature, and
2. Syscache cleanup should be something that only happens every so
many minutes, and probably while the backend is otherwise idle,
whereas lookups can happen many times per millisecond.However, perhaps someone will provide some evidence that casts a
different light on the situation.
It's closer to my feeling. When cache is enlarged, all entries
are copied into new twice-in-size hash. If some entries removed,
we don't need to duplicate the whole hash, otherwise it means
that we do extra scan. We don't the pruning scan not frequently
than the interval so it is not a bad bid.
I don't see much point in continuing to review this patch at this
point. There's been no new version of the patch in 3 weeks, and there
is -- in my view at least -- a rather frustrating lack of evidence
that the complexity this patch introduces is actually beneficial. No
matter how many people +1 the idea of making this more complicated, it
can't be justified unless you can provide a test result showing that
the additional complexity solves a problem that does not get solved
without that complexity. And even then, who is going to commit a
patch that uses a design which Tom Lane says was tried before and
stunk?
Hmm. Anyway it is hit by recent commit. I'll post a rebased
version and a version reverted to do hole-scan. Then I'll take
numbers as far as I can and will show the result.. tomorrow.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello. Sorry for being late a bit.
At Wed, 27 Mar 2019 17:30:37 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190327.173037.40342566.horiguchi.kyotaro@lab.ntt.co.jp>
I don't see much point in continuing to review this patch at this
point. There's been no new version of the patch in 3 weeks, and there
is -- in my view at least -- a rather frustrating lack of evidence
that the complexity this patch introduces is actually beneficial. No
matter how many people +1 the idea of making this more complicated, it
can't be justified unless you can provide a test result showing that
the additional complexity solves a problem that does not get solved
without that complexity. And even then, who is going to commit a
patch that uses a design which Tom Lane says was tried before and
stunk?Hmm. Anyway it is hit by recent commit. I'll post a rebased
version and a version reverted to do hole-scan. Then I'll take
numbers as far as I can and will show the result.. tomorrow.
I took performance numbers for master and three versions of the
patch. Master, LRU, full-scan, modified full-scan. I noticed that
useless scan can be skipped in full-scan version so I added the
last versoin.
I ran three artificial test cases. The database is created by
gen_tbl.pl. Numbers are the average of the fastest five runs in
successive 15 runs.
Test cases are listed below.
1_0. About 3,000,000 negative entries are created in pg_statstic
cache by scanning that many distinct columns. It is 3000 tables
* 1001 columns. Pruning scans happen several times while a run
but no entries are removed. This emulates the bloating phase of
cache. catalog_cache_prune_min_age is default (300s).
(access_tbl1.pl)
1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0,
which means turning off.
2_0. Repeatedly access 1001 of the 3,000,000 entries 6000
times. This emulates the stable cache case without having
pruning. catalog_cache_prune_min_age is default (300s).
(access_tbl2.pl)
2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0,
which means turning off.
3_0. Scan over the 3,000,000 entries twice with setting prune_age
to 10s. A run takes about 18 seconds on my box so fair amount
of old entries are removed. This emulates the stable case with
continuous pruning. (access_tbl3.pl)
2_1. Same to 3_0 except that catalog_cache_prune_min_age is 0,
which means turning off.
The result follows.
| master | LRU | Full |Full-mod|
-----|--------+--------+--------+--------+
1_0 | 17.287 | 17.370 | 17.255 | 16.623 |
1_1 | 17.287 | 17.063 | 16.336 | 17.192 |
2_0 | 15.695 | 18.769 | 18.563 | 15.527 |
2_1 | 15.695 | 18.603 | 18.498 | 18.487 |
3_0 | 26.576 | 33.817 | 34.384 | 34.971 |
3_1 | 26.576 | 27.462 | 26.202 | 26.368 |
The result of 2_0 and 2_1 seems strange, but I show you the
numbers at the present.
- Full-scan seems to have the smallest impact when turned off.
- Full-scan-mod seems to perform best in total. (as far as
Full-mod-2_0 is wrong value..)
- LRU doesn't seem to outperform full scanning.
For your information I measured how long pruning takes time.
LRU 318318 out of 2097153 entries in 26ms: 0.08us/entry.
Full-scan 443443 out of 2097153 entreis in 184ms. 0.4us/entry.
LRU is actually fast to remove entries but the difference seems
to be canceled by the complexity of LRU maintenance.
As my conclusion, we should go with the Full-scan or
Full-scan-mod version. I conduct a further overnight test and
will see which is better.
I attached the test script set. It is used in the folling manner.
(start server)
# perl gen_tbl.pl | psql postgres
(stop server)
# sh run.sh 30 > log.txt # 30 is repeat count
# perl process.pl
| master | LRU | Full |Full-mod|
-----|--------+--------+--------+--------+
1_0 | 16.711 | 17.647 | 16.767 | 17.256 |
...
The attached files are follow.
LRU versions patches.
LRU-0001-Add-dlist_move_tail.patch
LRU-0002-Remove-entries-that-haven-t-been-used-for-a-certain-.patch
Fullscn version patch.
FullScan-0001-Remove-entries-that-haven-t-been-used-for-a-certain-.patch
Fullscn-mod version patch.
FullScan-mod-0001-Remove-entries-that-haven-t-been-used-for-a-certain-.patch
test scripts.
test_script.tar.gz
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Fri, 29 Mar 2019 17:24:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190329.172440.199616830.horiguchi.kyotaro@lab.ntt.co.jp>
I ran three artificial test cases. The database is created by
gen_tbl.pl. Numbers are the average of the fastest five runs in
successive 15 runs.Test cases are listed below.
1_0. About 3,000,000 negative entries are created in pg_statstic
cache by scanning that many distinct columns. It is 3000 tables
* 1001 columns. Pruning scans happen several times while a run
but no entries are removed. This emulates the bloating phase of
cache. catalog_cache_prune_min_age is default (300s).
(access_tbl1.pl)1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0,
which means turning off.2_0. Repeatedly access 1001 of the 3,000,000 entries 6000
times. This emulates the stable cache case without having
pruning. catalog_cache_prune_min_age is default (300s).
(access_tbl2.pl)2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0,
which means turning off.3_0. Scan over the 3,000,000 entries twice with setting prune_age
to 10s. A run takes about 18 seconds on my box so fair amount
of old entries are removed. This emulates the stable case with
continuous pruning. (access_tbl3.pl)2_1. Same to 3_0 except that catalog_cache_prune_min_age is 0,
which means turning off.The result follows.
| master | LRU | Full |Full-mod|
-----|--------+--------+--------+--------+
1_0 | 17.287 | 17.370 | 17.255 | 16.623 |
1_1 | 17.287 | 17.063 | 16.336 | 17.192 |
2_0 | 15.695 | 18.769 | 18.563 | 15.527 |
2_1 | 15.695 | 18.603 | 18.498 | 18.487 |
3_0 | 26.576 | 33.817 | 34.384 | 34.971 |
3_1 | 26.576 | 27.462 | 26.202 | 26.368 |The result of 2_0 and 2_1 seems strange, but I show you the
numbers at the present.- Full-scan seems to have the smallest impact when turned off.
- Full-scan-mod seems to perform best in total. (as far as
Full-mod-2_0 is wrong value..)- LRU doesn't seem to outperform full scanning.
I had another.. unstable.. result.
| master | LRU | Full |Full-mod|
-----|--------+--------+--------+--------+
1_0 | 16.312 | 16.540 | 16.482 | 16.348 |
1_1 | 16.312 | 16.454 | 16.335 | 16.232 |
2_0 | 16.710 | 16.954 | 17.873 | 17.345 |
2_1 | 16.710 | 17.373 | 18.499 | 17.563 |
3_0 | 25.010 | 33.031 | 33.452 | 33.937 |
3_1 | 25.010 | 24.784 | 24.570 | 25.453 |
Normalizing on master's result and rounding off to 1.0%, it looks
as:
| master | LRU | Full |Full-mod| Test description
-----|--------+--------+--------+--------+-----------------------------------
1_0 | 100 | 101 | 101 | 100 | bloating. pruning enabled.
1_1 | 100 | 101 | 100 | 100 | bloating. pruning disabled.
2_0 | 100 | 101 | 107 | 104 | normal access. pruning enabled.
2_1 | 100 | 104 | 111 | 105 | normal access. pruning disabled.
3_0 | 100 | 132 | 134 | 136 | pruning continuously running.
3_1 | 100 | 99 | 98 | 102 | pruning disabled.
I'm not sure why the 2_1 is slower than 2_0, but LRU impacts
least if the numbers are right.
I will investigate the strange behavior using profiler.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 01 Apr 2019 11:05:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190401.110532.102998353.horiguchi.kyotaro@lab.ntt.co.jp>
At Fri, 29 Mar 2019 17:24:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190329.172440.199616830.horiguchi.kyotaro@lab.ntt.co.jp>
I ran three artificial test cases. The database is created by
gen_tbl.pl. Numbers are the average of the fastest five runs in
successive 15 runs.Test cases are listed below.
1_0. About 3,000,000 negative entries are created in pg_statstic
cache by scanning that many distinct columns. It is 3000 tables
* 1001 columns. Pruning scans happen several times while a run
but no entries are removed. This emulates the bloating phase of
cache. catalog_cache_prune_min_age is default (300s).
(access_tbl1.pl)1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0,
which means turning off.2_0. Repeatedly access 1001 of the 3,000,000 entries 6000
times. This emulates the stable cache case without having
pruning. catalog_cache_prune_min_age is default (300s).
(access_tbl2.pl)2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0,
which means turning off.3_0. Scan over the 3,000,000 entries twice with setting prune_age
to 10s. A run takes about 18 seconds on my box so fair amount
of old entries are removed. This emulates the stable case with
continuous pruning. (access_tbl3.pl)3_1. Same to 3_0 except that catalog_cache_prune_min_age is 0,
which means turning off.
..
I had another.. unstable.. result.
dlist_move_head is used every time an entry is accessed. It moves
the accessed element to the top of bucket expecting that
subsequent access become faster - a kind of LRU maintenance. But
the mean length of a bucket is 2 so dlist_move_head is too
complex than following one step of link. So I removed it in
pruning patch.
I understand I cannot get rid of noise a far as I'm poking the
feature from client via communication and SQL layer.
The attached extension surgically exercises
SearchSysCache3(STATRELATTINH) in the almost pattern with the
benchmarks taken last week. I believe that that gives far
reliable numbers. But still the number fluctuates by up to about
10% every trial, and the difference among the methods is under
the fluctulation. I'm tired.. But this still looks somewhat wrong.
ratio in the following table is the percentage to the master for
the same test. master2 is a version that removed the
dlink_move_head from master.
binary | test | count | avg | stddev | ratio
---------+------+-------+---------+--------+--------
master | 1_0 | 5 | 7841.42 | 6.91
master | 2_0 | 5 | 3810.10 | 8.51
master | 3_0 | 5 | 7826.17 | 11.98
master | 1_1 | 5 | 7905.73 | 5.69
master | 2_1 | 5 | 3827.15 | 5.55
master | 3_1 | 5 | 7822.67 | 13.75
---------+------+-------+---------+--------+--------
master2 | 1_0 | 5 | 7538.05 | 16.65 | 96.13
master2 | 2_0 | 5 | 3927.05 | 11.58 | 103.07
master2 | 3_0 | 5 | 7455.47 | 12.03 | 95.26
master2 | 1_1 | 5 | 7485.60 | 9.38 | 94.69
master2 | 2_1 | 5 | 3870.81 | 5.54 | 101.14
master2 | 3_1 | 5 | 7437.35 | 9.91 | 95.74
---------+------+-------+---------+--------+--------
LRU | 1_0 | 5 | 7633.57 | 9.00 | 97.35
LRU | 2_0 | 5 | 4062.43 | 5.90 | 106.62
LRU | 3_0 | 5 | 8340.51 | 6.12 | 106.57
LRU | 1_1 | 5 | 7645.87 | 13.29 | 96.71
LRU | 2_1 | 5 | 4026.60 | 7.56 | 105.21
LRU | 3_1 | 5 | 8400.10 | 19.07 | 107.38
---------+------+-------+---------+--------+--------
Full | 1_0 | 5 | 7481.61 | 6.70 | 95.41
Full | 2_0 | 5 | 4084.46 | 14.50 | 107.20
Full | 3_0 | 5 | 8166.23 | 14.80 | 104.35
Full | 1_1 | 5 | 7447.20 | 10.93 | 94.20
Full | 2_1 | 5 | 4016.88 | 8.53 | 104.96
Full | 3_1 | 5 | 8258.80 | 7.91 | 105.58
---------+------+-------+---------+--------+--------
FullMod | 1_0 | 5 | 7291.80 | 14.03 | 92.99
FullMod | 2_0 | 5 | 4006.36 | 7.64 | 105.15
FullMod | 3_0 | 5 | 8143.60 | 9.26 | 104.06
FullMod | 1_1 | 5 | 7270.66 | 6.24 | 91.97
FullMod | 2_1 | 5 | 3996.20 | 13.00 | 104.42
FullMod | 3_1 | 5 | 8012.55 | 7.09 | 102 43
So "Full (scan) Mod" wins again, or the diffence is under error.
I don't think this level of difference can be a reason to reject
this kind of resource saving mechanism. LRU version doesn't seem
particularly slow but also doesn't seem particularly fast for the
complexity. FullMod version doesn't look differently.
So it seems to me that the simplest "Full" version wins. The
attached is rebsaed version. dlist_move_head(entry) is removed as
mentioned above in that patch.
The third and fourth attached are a set of script I used.
$ perl gen_tbl.pl | psql postgres
$ run.sh > log.txt
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Thu, Apr 4, 2019 at 8:53 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
So it seems to me that the simplest "Full" version wins. The
attached is rebsaed version. dlist_move_head(entry) is removed as
mentioned above in that patch.
1. I really don't think this patch has any business changing the
existing logic. You can't just assume that the dlist_move_head()
operation is unimportant for performance.
2. This patch still seems to add a new LRU list that has to be
maintained. That's fairly puzzling. You seem to have concluded that
the version without the additional LRU wins, but the sent a new copy
of the version with the LRU version.
3. I don't think adding an additional call to GetCurrentTimestamp() in
start_xact_command() is likely to be acceptable. There has got to be
a way to set this up so that the maximum number of new
GetCurrentTimestamp() is limited to once per N seconds, vs. the
current implementation that could do it many many many times per
second.
4. The code in CatalogCacheCreateEntry seems clearly unacceptable. In
a pathological case where CatCacheCleanupOldEntries removes exactly
one element per cycle, it could be called on every new catcache
allocation.
I think we need to punt this patch to next release. We're not
converging on anything committable very fast.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Thank you for the comment.
At Thu, 4 Apr 2019 15:44:35 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZQx7pCcc=VO3WeDQNpco8h6MZN09KjcOMRRu_CrbeoSw@mail.gmail.com>
On Thu, Apr 4, 2019 at 8:53 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:So it seems to me that the simplest "Full" version wins. The
attached is rebsaed version. dlist_move_head(entry) is removed as
mentioned above in that patch.1. I really don't think this patch has any business changing the
existing logic. You can't just assume that the dlist_move_head()
operation is unimportant for performance.
Ok, it doesn't show significant performance gain so removed that.
2. This patch still seems to add a new LRU list that has to be
maintained. That's fairly puzzling. You seem to have concluded that
the version without the additional LRU wins, but the sent a new copy
of the version with the LRU version.
Sorry, I attached wrong one. The attached is the right one, which
doesn't adds the new dlist.
3. I don't think adding an additional call to GetCurrentTimestamp() in
start_xact_command() is likely to be acceptable. There has got to be
a way to set this up so that the maximum number of new
GetCurrentTimestamp() is limited to once per N seconds, vs. the
current implementation that could do it many many many times per
second.
The GetCurrentTimestamp() is called only once at very early in
the backend's life in InitPostgres. Not in
start_xact_command. What I did in the function is just copying
stmtStartTimstamp, not GetCurrentTimestamp().
4. The code in CatalogCacheCreateEntry seems clearly unacceptable. In
a pathological case where CatCacheCleanupOldEntries removes exactly
one element per cycle, it could be called on every new catcache
allocation.
It may be a problem, if just one entry was created in the
duration longer than by catalog_cache_prune_min_age and resize
interval, or all candidate entries except one are actually in use
at the pruning moment. Is it realistic?
I think we need to punt this patch to next release. We're not
converging on anything committable very fast.
Yeah, maybe right. This patch had several month silence several
times, got comments and modified taking in the comments for more
than two cycles, and finally had a death sentence (not literaly,
actually postpone) at very close to this third cycle end. I
anticipate the same continues in the next cycle.
By the way, I found the reason of the wrong result of the
previous benchmark. The test 3_0/1 needs to update catcacheclock
midst of the loop. I'm going to fix it and rerun it.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Fri, 05 Apr 2019 09:44:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190405.094407.151644324.horiguchi.kyotaro@lab.ntt.co.jp>
By the way, I found the reason of the wrong result of the
previous benchmark. The test 3_0/1 needs to update catcacheclock
midst of the loop. I'm going to fix it and rerun it.
I found the cause. CataloCacheFlushCatalog() doesn't shring the
hash. So no resize happens once it is bloated. I needed another
version of the function that reset the cc_bucket to the initial
size.
Using the new debug function, I got better numbers.
I focused on the performance when disabled. I rechecked that by
adding the patch part-by-part and identified several causes of
the degradaton. I did:
- MovedpSetCatCacheClock() to AtStart_Cache()
- Maybe improved the caller site of CatCacheCleanupOldEntries().
As the result:
binary | test | count | avg | stddev |
--------+------+-------+---------+--------+-------
master | 1_1 | 5 | 7104.90 | 4.40 |
master | 2_1 | 5 | 3759.26 | 4.20 |
master | 3_1 | 5 | 7954.05 | 2.15 |
--------+------+-------+---------+--------+-------
Full | 1_1 | 5 | 7237.20 | 7.98 | 101.87
Full | 2_1 | 5 | 4050.98 | 8.42 | 107.76
Full | 3_1 | 5 | 8192.87 | 3.28 | 103.00
But, still fluctulates by around 5%..
If this level of the degradation is still not acceptable, that
means that nothing can be inserted in the code path and the new
code path should be isolated from existing code by using indirect
call.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
Does this result show that hard-limit size option with memory accounting doesn't harm
to usual users who disable hard limit size option?
Hi,
I've implemented relation cache size limitation with LRU list and built-in memory context size account.
And I'll share some results coupled with a quick recap of catcache so that we can resume discussion if needed
though relation cache bloat was also discussed in this thread but right now it's pending
and catcache feature is not fixed. But a variety of information could be good I believe.
Regarding catcache it seems to me recent Horiguchi san posts shows a pretty detailed stats
including comparison LRU overhead and full scan of hash table. According to the result, lru overhead seems small
but for simplicity this thread go without LRU.
/messages/by-id/20190404.215255.09756748.horiguchi.kyotaro@lab.ntt.co.jp
When there was hard limit of catcach, there was built-in memory context size accounting machinery.
I checked the overhead of memory accounting, and when repeating palloc and pfree of 800 byte area many times it was 4% down
on the other hand in case of 32768 byte there seems no overhead.
/messages/by-id/4E72940DA2BF16479384A86D54D0988A6F44564E@G01JPEXMBKW04
Regarding relcache hard limit (relation_cache_max_size), most of the architecture was similar as catcache one with LRU list except memory accounting.
Relcaches are managed by LRU list. To prune LRU cache, we need to know overall relcache sizes including objects pointed by relcache
like 'index info'.
So in this patch relcache objects are allocated under RelCacheMemoryContext, which is child of CacheMemoryContext. Objects pointed by
relcache is allocated under child context of RelCacheMemoryContext.
In built-in size accounting, if memoryContext is set to collect "group(family) size", you can get context size including child easily.
I ran two experiments:
A) One is pgbench using Tomas's script he posted while ago, which is randomly select 1 from many tables.
/messages/by-id/4E72940DA2BF16479384A86D54D0988A6F426207@G01JPEXMBKW04
B) The other is to check memory context account overhead using the same method.
/messages/by-id/4E72940DA2BF16479384A86D54D0988A6F44564E@G01JPEXMBKW04
A) randomly select 1 from many tables
Results are average of 5 times each.
number of tables | 100 |1000 |10000
-----------------------------------------------------------
TPS (master) |11105 |10815 |8915
TPS (patch; limit feature off) |11254 (+1%) |11176 (+3%) |9242 (+4%)
TPS (patch: limit on with 1MB) |11317 (+2%) |10491 (-3%) |7380 (-17%)
The results are noisy but it seems overhead of LRU and memory accounting is small when turning off the relcache limit feature.
When turning on the limit feature, after exceeding the limit it drops 17%, which is no surprise.
B) Repeat palloc/pfree
"With group accounting" means that account test context and its child context with built-in accounting using "palloc_bench_family()".
The other one is that using palloc_bench(). Please see palloc_bench.gz.
[Size=32768, iter=1,000,000]
Master | 59.97 ms
Master with group account | 59.57 ms
patched |67.23 ms
patched with family |68.81 ms
It seems that overhead seems large in this patch. So it needs more inspection this area.
regards,
Takeshi Ideriha
Attachments:
Hello,
my_gripe> But, still fluctulates by around 5%..
my_gripe>
my_gripe> If this level of the degradation is still not acceptable, that
my_gripe> means that nothing can be inserted in the code path and the new
my_gripe> code path should be isolated from existing code by using indirect
my_gripe> call.
Finally, after some struggling, I think I could manage to measure
the impact on performace precisely and reliably. Starting from
"make distclean" every time building, then removing all in
$TARGET before installation makes things stable enough. (I don't
think it's good but I didin't investigate the cause..)
I measured time/call by directly calling SearchSysCache3() many
times. It showed that the patch causes around 0.1 microsec
degradation per call. (The funtion overall took about 6.9
microsec on average.)
Next, I counted how many times SearchSysCache is called during a
planning with, as an instance, a query on a partitioned table
having 3000 columns and 1000 partitions.
explain analyze select sum(c0000) from test.p;
Planner made 6020608 times syscache calls while planning and the
overall planning time was 8641ms. (Exec time was 48ms.) 6020608
times 0.1 us is 602 ms of degradation. So roughly -7% degradation
in planning time in estimation. The degradation was given by
really only the two successive instructions "ADD/conditional
MOVE(CMOVE)". The fact leads to the conclusion that the existing
code path as is doesn't have room for any additional code.
So I sought for room at least for one branch and found that (on
gcc 7.3.1/CentOS7/x64). Interestingly, de-inlining
SearchCatCacheInternal gave me gain on performance by about
3%. Further inlining of CatalogCacheComputeHashValue() gave
another gain about 3%. I could add a branch in
SearchCatCacheInteral within the gain.
I also tried indirect calls but the degradation overwhelmed the
gain, so I choosed branching rather than indirect calls. I didn't
investigated how it happens.
The following is the result. The binaries are build with the same
configuration using -O2.
binary means
master : master HEAD.
patched_off : patched, but pruning disabled (catalog_cache_prune_min_age=-1).
patched_on : patched with pruning enabled.
("300s" for 1, "1s" for2, "0" for 3)
bench:
1: corresponds to catcachebench(1); fetching STATRELATTINH 3000
* 1000 times generating new cache entriies. (Massive cache
creatiion)
Pruning doesn't happen while running this.
2: catcachebench(2); 60000 times cache access on 1000
STATRELATTINH entries. (Frequent cache reference)
Pruning doesn't happen while running this.
3: catcachebench(3); fetching 1000(tbls) * 3000(cols)
STATRELATTINH entries. Catcache clock advancing with the
interval of 100(tbls) * 3000(cols) times of access and
pruning happenshoge.
While running catcachebench(3) once, pruning happens 28
times and most of the time 202202 entries are removed and
the total number of entries was limite to 524289. (The
systable has 3000 * 1001 = 3003000 tuples.)
iter: Number of iterations. Time ms and stddev is calculated over
the iterations.
binar | bench | iter | time ms | stddev
-------------+-------+-------+----------+--------
master | 1 | 10 | 8150.30 | 12.96
master | 2 | 10 | 4002.88 | 16.18
master | 3 | 10 | 9065.06 | 11.46
-------------+-------+-------+----------+--------
patched_off | 1 | 10 | 8090.95 | 9.95
patched_off | 2 | 10 | 3984.67 | 12.33
patched_off | 3 | 10 | 9050.46 | 4.64
-------------+-------+-------+----------+--------
patched_on | 1 | 10 | 8158.95 | 6.29
patched_on | 2 | 10 | 4023.72 | 10.41
patched_on | 3 | 10 | 16532.66 | 18.39
patched_off is slightly faster than master. patched_on is
generally a bit slower. Even though patched_on/3 seems take too
long time, the extra time comes from increased catalog table
acess in exchange of memory saving. (That is, it is expected
behavior.) I ran it several times and most them showed the same
tendency.
As a side-effect, once the branch added, the shared syscache in a
neighbour thread will be able to be inserted together without
impact on existing code path.
===
The benchmark script is used as the follows:
- create many (3000, as example) tables in "test" schema. I
created a partitioned table with 3000 children.
- The tables have many columns, 1000 for me.
- Run the following commands.
=# select catcachebench(0); -- warm up systables.
=# set catalog_cache_prune_min_age = any; -- as required
=# select catcachebench(n); -- 3 >= n >= 1, the number of "bench" above.
The above result is taked with the following query.
=# select 'patched_on', '3' , count(a), avg(a)::numeric(10,2), stddev(a)::numeric(10,2) from (select catcachebench(3) from generate_series(1, 10)) as a(a);
====
The attached patches are:
0001-Adjust-inlining-of-some-functions.patch:
Changes inlining property of two functions,
SearchCatCacheInternal and CatalogCacheComputeHashValue.
0002-Benchmark-extension-and-required-core-change.patch:
Micro benchmark of SearchSysCache3() and core-side tweaks, which
is out-of this patch set in the view of functionality. Works for
0001 but not for 0004 or later. 0003 adjusts that.
0003-Adjust-catcachebench-for-later-patches.patch
Adjustment of 0002, benchmark for 0004, the body of this
patchset. Breaks code consistency until 0004 applied.
0004-Catcache-pruning-feature.patch
The feature patch, intentionally unchanges indentation of an
existing code block in SearchCatCacheInternal for smaller size
of the patch. It is adjusted in the next 0005 patch.
0005-Adjust-indentation-of-SearchCatCacheInternal.patch
Adjusts indentation of 0004.
0001+4+5 is the final shape of the patch set and 0002+3 is only
for benchmarking.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
I'd like to throw in food for discussion on how much SearchSysCacheN
suffers degradation from some choices on how we can insert a code into
the SearchSysCacheN code path.
I ran the run2.sh script attached, which runs catcachebench2(), which
asks SearchSysCache3() for cached entries (almost) 240000 times per
run. The number of each output line is the mean of 3 times runs, and
stddev. Lines are in "time" order and edited to fit here. "gen_tbl.pl
| psql" creates a database for the benchmark. catcachebench2() runs
the shortest path in the three in the attached benchmark program.
(pg_ctl start)
$ perl gen_tbl.pl | psql ...
(pg_ctl stop)
0. Baseline (0001-benchmark.patch, 0002-Base-change.patch)
At first, I made two binaries from the literally same source. For the
benchmark's sake the source is already modified a bit. Specifically it
has SetCatCacheClock needed by the benchmark, but actually not called
in this benchmark.
time(ms)|stddev(ms)
not patched | 7750.42 | 23.83 # 0.6% faster than 7775.23
not patched | 7864.73 | 43.21
not patched | 7866.80 | 106.47
not patched | 7952.06 | 63.14
master | 7775.23 | 35.76
master | 7870.42 | 120.31
master | 7876.76 | 109.04
master | 7963.04 | 9.49
So, it seems to me that we cannot tell something about differences
below about 80ms (about 1%) now.
1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch)
This is the most straightforward way to add an alternative feature.
pattern 1 | 8459.73 | 28.15 # 9% (>> 1%) slower than 7757.58
pattern 1 | 8504.83 | 55.61
pattern 1 | 8541.81 | 41.56
pattern 1 | 8552.20 | 27.99
master | 7757.58 | 22.65
master | 7801.32 | 20.64
master | 7839.57 | 25.28
master | 7925.30 | 38.84
It's so slow that it cannot be used.
2. Making SearchCatCacheInternal be an indirect function.
(CatCache_Pattern_2.patch)
Next, I made the work horse routine be called indirectly. The "inline"
for the function acutally let compiler optimize SearchCatCacheN
routines as described in comment but the effect doesn't seem so large
at least for this case.
pattern 2 | 7976.22 | 46.12 (2.6% slower > 1%)
pattern 2 | 8103.03 | 51.57
pattern 2 | 8144.97 | 68.46
pattern 2 | 8353.10 | 34.89
master | 7768.40 | 56.00
master | 7772.02 | 29.05
master | 7775.05 | 27.69
master | 7830.82 | 13.78
3. Making SearchCatCacheN be indirect functions. (CatCache_Pattern_3.patch)
As far as gcc/linux/x86 goes, SearchSysCacheN is comiled into the
following instructions:
0x0000000000866c20 <+0>: movslq %edi,%rdi
0x0000000000866c23 <+3>: mov 0xd3da40(,%rdi,8),%rdi
0x0000000000866c2b <+11>: jmpq 0x856ee0 <SearchCatCache3>
If we made SearchCatCacheN be indirect functions as the patch, it
changes just one instruction as:
0x0000000000866c50 <+0>: movslq %edi,%rdi
0x0000000000866c53 <+3>: mov 0xd3da60(,%rdi,8),%rdi
0x0000000000866c5b <+11>: jmpq *0x4c0caf(%rip) # 0xd27910 <SearchCatCache3>
pattern 3 | 7836.26 | 48.66 (2% slower > 1%)
pattern 3 | 7963.74 | 67.88
pattern 3 | 7966.65 | 101.07
pattern 3 | 8214.57 | 71.93
master | 7679.74 | 62.20
master | 7756.14 | 77.19
master | 7867.14 | 73.33
master | 7893.97 | 47.67
I expected this runs in almost the same time. I'm not sure if it is
the result of spectre_v2 mitigation, but I show status of my
environment as follows.
# uname -r
4.18.0-80.11.2.el8_0.x86_64
# cat /proc/cpuinfo
...
model name : Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
stepping : 12
microcode : 0xae
bugs : spectre_v1 spectre_v2 spec_store_bypass mds
# cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: disabled, RSB filling
I am using CentOS8 and I don't find a handy (or on-the-fly) way to
disable them..
Attached are:
0001-benchmark.patch : catcache benchmark extension (and core side fix)
0002-Base-change.patch : baseline change in this series of benchmark
CatCache_Pattern_1.patch: naive branching
CatCache_Pattern_2.patch: indirect SearchCatCacheInternal
CatCache_Pattern_1.patch: indirect SearchCatCacheN
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Tue, Nov 19, 2019 at 07:48:10PM +0900, Kyotaro Horiguchi wrote:
I'd like to throw in food for discussion on how much SearchSysCacheN
suffers degradation from some choices on how we can insert a code into
the SearchSysCacheN code path.
Please note that the patch has a warning, causing cfbot-san to
complain:
catcache.c:786:1: error: no previous prototype for
‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]
CatalogCacheFlushCatalog2(Oid catId)
^
cc1: all warnings being treated as errors
So this should at least be fixed. For now I have moved it to next CF,
waiting on author.
--
Michael
This is a new complete workable patch after a long time of struggling
with benchmarking.
At Tue, 19 Nov 2019 19:48:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I ran the run2.sh script attached, which runs catcachebench2(), which
asks SearchSysCache3() for cached entries (almost) 240000 times per
run. The number of each output line is the mean of 3 times runs, and
stddev. Lines are in "time" order and edited to fit here. "gen_tbl.pl
| psql" creates a database for the benchmark. catcachebench2() runs
the shortest path in the three in the attached benchmark program.(pg_ctl start)
$ perl gen_tbl.pl | psql ...
(pg_ctl stop)
I wonder why I took the average of the time instead of choose the
fastest one. This benchmark is extremely CPU intensive so the fastest
run reliably represents the perfromance.
I changed the benchmark so that it shows the time of the fastest run
(run4.sh). Based on the latest result, I used the pattern 3
(SearchSyscacheN indirection, but wrongly pointed as 1 in the last
mail) in the latest version,
I took the number of the fastest time among 3 iteration of 5 runs of
both master/patched O2 binaries.
version | min
---------+---------
master | 7986.65
patched | 7984.47 = 'indirect' below
I would say this version doesn't get degradaded by indirect calls.
So, I applied the other part of the catcache expiration patch as the
succeeding parts. After that I got somewhat strange but very stable
result. Just adding struct members acceleartes the benchmark. The
numbers are the fastest time of 20 runs of the bencmark in 10 times
iterations.
ms
master 7980.79 # the master with the benchmark extension (0001)
=====
base 7340.96 # add only struct members and a GUC variable. (0002)
indirect 7998.68 # call SearchCatCacheN indirectly (0003)
=====
expire-off 7422.30 # CatCache expiration (0004)
# (catalog_cache_prune_min_age = -1)
expire-on 7861.13 # CatCache expiration (catalog_cache_prune_min_age = 0)
The patch accelerates CatCaCheSearch for uncertain reasons. I'm not
sure what makes the difference between about 8000ms and about 7400 ms,
though. Several times building of all versions then running the
benchmark gave me the results with the same tendency. I once stop this
work at this point and continue later. The following files are
attached.
0001-catcache-benchmark-extension.patch:
benchmnark extension used by the benchmarking here. The test tables
are generated using gentbl2.pl attached. (perl gentbk2.pl | psql)
0002-base_change.patch:
Preliminary adds some struct members and a GUC variable to see if
they cause any extent of degradation.
0003-Make-CatCacheSearchN-indirect-functions.patch:
Rewrite to change CatCacheSearchN functions to be called indirectly.
0004-CatCache-expiration-feature.patch:
Add CatCache expiration feature.
gentbl2.pl: A script that emits SQL statements to generate test tables.
run4.sh : The test script I used for benchmarkiing here.
build2.sh : A script I used to build the four types of binaries used here.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hello Kyotaro-san,
I see this patch is stuck in WoA since 2019/12/01, although there's a
new patch version from 2020/01/14. But the patch seems to no longer
apply, at least according to https://commitfest.cputube.org :-( So at
this point the status is actually correct.
Not sure about the appveyor build (it seems to be about jsonb_set_lax),
but on travis it fails like this:
catcache.c:820:1: error: no previous prototype for âCatalogCacheFlushCatalog2â [-Werror=missing-prototypes]
so I'll leave it in WoA for now.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-21, Tomas Vondra wrote:
Not sure about the appveyor build (it seems to be about jsonb_set_lax),
but on travis it fails like this:catcache.c:820:1: error: no previous prototype for âCatalogCacheFlushCatalog2â [-Werror=missing-prototypes]
Hmm ... travis is running -Werror? That seems overly strict. I think
we shouldn't punt a patch because of that.
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
On 2020-Jan-21, Tomas Vondra wrote:
Not sure about the appveyor build (it seems to be about jsonb_set_lax),
FWIW, I think I fixed jsonb_set_lax yesterday, so that problem should
be gone the next time the cfbot tries this.
but on travis it fails like this:
catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]
Hmm ... travis is running -Werror? That seems overly strict. I think
we shouldn't punt a patch because of that.
Why not? We're not going to allow pushing a patch that throws warnings
on common compilers. Or if that does happen, some committer is going
to have to spend time cleaning it up. Better to clean it up sooner.
(There is, btw, at least one buildfarm animal using -Werror.)
regards, tom lane
Hello.
At Tue, 21 Jan 2020 14:17:53 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
On 2020-Jan-21, Tomas Vondra wrote:
Not sure about the appveyor build (it seems to be about jsonb_set_lax),
FWIW, I think I fixed jsonb_set_lax yesterday, so that problem should
be gone the next time the cfbot tries this.but on travis it fails like this:
catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]Hmm ... travis is running -Werror? That seems overly strict. I think
we shouldn't punt a patch because of that.Why not? We're not going to allow pushing a patch that throws warnings
on common compilers. Or if that does happen, some committer is going
to have to spend time cleaning it up. Better to clean it up sooner.(There is, btw, at least one buildfarm animal using -Werror.)
Mmm. The cause of the error is tentative (or crude or brute)
benchmarking function provided as an extension which is not actually a
part of the patch and was included for reviewer's convenience.
Howerver, I don't want it work on Windows build. If that is regarded
as a reason for being punt, I'll repost a new version without the
benchmark soon.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Jan 21, 2020 at 02:17:53PM -0500, Tom Lane wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Hmm ... travis is running -Werror? That seems overly strict. I think
we shouldn't punt a patch because of that.Why not? We're not going to allow pushing a patch that throws warnings
on common compilers. Or if that does happen, some committer is going
to have to spend time cleaning it up. Better to clean it up sooner.(There is, btw, at least one buildfarm animal using -Werror.)
I agree that it is good to have in Mr Robot. More early detection
means less follow-up cleanup.
--
Michael
At Tue, 21 Jan 2020 17:29:47 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
I see this patch is stuck in WoA since 2019/12/01, although there's a
new patch version from 2020/01/14. But the patch seems to no longer
apply, at least according to https://commitfest.cputube.org :-( So at
this point the status is actually correct.Not sure about the appveyor build (it seems to be about
jsonb_set_lax),
but on travis it fails like this:catcache.c:820:1: error: no previous prototype for
‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]
I changed my mind to attach the benchmark patch as .txt file,
expecting the checkers not picking up it as a part of the patchset.
I have in the precise performance measurement mode for a long time,
but I think it's settled. I'd like to return to normal mode and
explain this patch.
=== Motive of the patch
System cache is a mechanism that accelerates access to system catalogs
Basically the entries in a cache is removed via invalidation machanism
when corresponding system catalog entry is removed. On the other hand
the system cache also holds "negative" entries that indicates that the
object is nonexistent, which entry accelerates the response for
nonexistent objects. But the negative cache doesn't have a chance of
removal.
On a long-lived session that accepts a wide variety of queries on many
objects, system cache holds the cache entries for many objects that
are accessed once or a few times. Suppose every object is accessed
once per, say, 30 minutes, and the query doesn't needed to run in a
very short time. Such cache entries are almost useless but occupy a
large amount of memory.
=== Possible solutions.
Many caching system has expiration mechanism, which removes "useless"
entries to keep the size under a certain limit. The limit is
typically defined by memory usage or expiration time, in a hard or
soft way. Since we don't implement an detailed accouting of memory
usage by cache for the performance reasons, we can use coarse memory
accounting or expiration time. This patch uses expiration time
because it can be detemined on a rather clearer basis.
=== Pruning timing
The next point is when to prune cache entries. Apparently it's not
reasonable to do on every cache access time, since pruning takes a far
longer time than cache access.
The system cache is implemented on a hash. When there's no room for a new cache entry, it gets twice in size and rehashes all entries. If pruning made some space for the new entry, rehashing can be avoided, so this patch tries pruning just before enlarging hash table.
A system cache can be shrinked if less than a half of the size is
used, but this patch doesn't that. It is because we cannot predict if
the system cache that have just shrinked is going to enlarged just
after and I don't want get this patch that complex.
=== Performance
The pruning mechanism adds several entries to cache entry and updates
System cache is a very light-weight machinery so that inserting one
branch affects performance apparently. So in this patch, the new stuff
is isolated from existing code path using indirect call. After trials
on some call-points that can be indirect calls, I found that
SearchCatCache[1-4]() is the only point that doesn't affect
performance. (Please see upthread for details.) That configuraion
also allows future implements of system caches, such like shared
system caches.
The alternative SearchCatCache[1-4] functions get a bit slower because
it maintains access timestamp and access counter. Addition to that
pruning puts a certain amount of additional time if no entries are not
pruned off.
=== Pruning criteria
At the pruning time described above, every entry is examined agianst
the GUC variable catalog_cache_prune_min_age. The pruning mechanism
involves a clock-sweep-like mechanism where an entry lives longer if
it had accessed. Entry of which access counter is zero is pruned after
catalog_cache_prune_min_age. Otherwise an entry survives the pruning
round and its counter is decremented.
All the timestamp used by the stuff is "catcacheclock", which is
updated at every transaction start.
=== Concise test
The attached test1.pl can be used to replay the syscache-bloat caused
by negative entries. Setting $prune_age to -1, pruning is turned of
and you can see that the backend unlimitedly takes more and more
memory as time proceeds. Setting it to 10 or so, the memory size of
backend process will stops raising at certain amount.
=== The patch
The attached following are the patch. They have been separated for the
benchmarking reasons but that seems to make the patch more easy to
read so I leave it alone. I forgot its correct version through a long
time of benchmarking so I started from v1 now.
- v1-0001-base_change.patch
Adds new members to existing structs and catcacheclock-related code.
- v1-0002-Make-CatCacheSearchN-indirect-functions.patch
Changes SearchCatCacheN functions to be called by indirect calls.
- v1-0003-CatCache-expiration-feature.patch
The core code of the patch.
- catcache-benchmark-extension.patch.txt
The benchmarking extension that was used for benchmarking
upthread. Just for information.
- test1.pl
Test script to make syscache bloat.
The patchset doesn't contain documentaion for the new GUC option. I
will add it later.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Wed, Jan 22, 2020 at 02:38:19PM +0900, Kyotaro Horiguchi wrote:
I changed my mind to attach the benchmark patch as .txt file,
expecting the checkers not picking up it as a part of the patchset.I have in the precise performance measurement mode for a long time,
but I think it's settled. I'd like to return to normal mode and
explain this patch.
Looking at the CF bot, this patch set does not compile properly.
Could you look at that?
--
Michael
At Thu, 1 Oct 2020 13:37:29 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Wed, Jan 22, 2020 at 02:38:19PM +0900, Kyotaro Horiguchi wrote:
I changed my mind to attach the benchmark patch as .txt file,
expecting the checkers not picking up it as a part of the patchset.I have in the precise performance measurement mode for a long time,
but I think it's settled. I'd like to return to normal mode and
explain this patch.Looking at the CF bot, this patch set does not compile properly.
Could you look at that?
It is complaining that TimestampDifference is implicitly used. I'm not
sure the exact cause but maybe some refactoring in header file
inclusion caused that.
This is the rebased version.
Thanks!
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hello.
The attached is the version that is compactified from the previous
version.
At Thu, 01 Oct 2020 16:47:18 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This is the rebased version.
It occurred to me suddenly that static parameters to inline functions
causes optimization. I split SearchCatCacheInternal() into two
almost-the-same functions SearchCatCacheInternalb() and -e() until the
previous version but in turn I merged the two functions into the
original function and added a parameter "do_expire".
Then compared at machine-code level of the two deduced functions
SearchCatCache1b and SearchCatCache1e and confirmed that the
expiration-related code is eliminated from the former.
The lines prefixed by '+' are the instructions corresponding the
following C-code, which are eliminated in SearchCatCache1b.
0000000000002770 <SearchCatCache1e>:
2770: push %r15
2772: push %r14
...
2849: mov %rbp,(%rbx)
284c: mov %rbx,(%rax)
284f: mov %rbx,0x8(%rbp)
+ 2853: mov 0x30(%rbx),%eax # %eax = ct->naccess
+ 2856: mov $0x2,%edx
+ 285b: add $0x1,%eax # ct->access++
+ 285e: cmove %edx,%eax # if(ct->access == 0) %eax = 2
2861: xor %ebp,%ebp
2863: cmpb $0x0,0x15(%rbx) # (if (!ct->negative))
+ 2867: mov %eax,0x30(%rbx) # ct->access = %eax
+ 286a: mov 0x0(%rip),%rax # %rax = catcacheclock
+ 2871: mov %rax,0x38(%rbx) # ct->lastaccess = %rax
2875: jne 289a <SearchCatCache1e+0x12a>
2877: mov 0x0(%rip),%rdi
if (do_expire)
{
ct->naccess++;
if (unlikely(ct->naccess == 0))
ct->naccess = 2;
ct->lastaccess = catcacheclock;
}
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 19/11/2019 12:48, Kyotaro Horiguchi wrote:
1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch)
This is the most straightforward way to add an alternative feature.
pattern 1 | 8459.73 | 28.15 # 9% (>> 1%) slower than 7757.58
pattern 1 | 8504.83 | 55.61
pattern 1 | 8541.81 | 41.56
pattern 1 | 8552.20 | 27.99
master | 7757.58 | 22.65
master | 7801.32 | 20.64
master | 7839.57 | 25.28
master | 7925.30 | 38.84It's so slow that it cannot be used.
This is very surprising. A branch that's never taken ought to be
predicted by the CPU's branch-predictor, and be very cheap.
Do we actually need a branch there? If I understand correctly, the point
is to bump up a usage counter on the catcache entry. You could increment
the counter unconditionally, even if the feature is not used, and avoid
the branch that way.
Another thought is to bump up the usage counter in ReleaseCatCache(),
and only when the refcount reaches zero. That might be somewhat cheaper,
if it's a common pattern to acquire additional leases on an entry that's
already referenced.
Yet another thought is to replace 'refcount' with an 'acquirecount' and
'releasecount'. In SearchCatCacheInternal(), increment acquirecount, and
in ReleaseCatCache, increment releasecount. When they are equal, the
entry is not in use. Now you have a counter that gets incremented on
every access, with the same number of CPU instructions in the hot paths
as we have today.
Or maybe there are some other ways we could micro-optimize
SearchCatCacheInternal(), to buy back the slowdown that this feature
would add? For example, you could remove the "if (cl->dead) continue;"
check, if dead entries were kept out of the hash buckets. Or maybe the
catctup struct could be made slightly smaller somehow, so that it would
fit more comfortably in a single cache line.
My point is that I don't think we want to complicate the code much for
this. All the indirection stuff seems over-engineered for this. Let's
find a way to keep it simple.
- Heikki
Thank you for the comment!
First off, I thought that I managed to eliminate the degradation
observed on the previous versions, but significant degradation (1.1%
slower) is still seen in on case.
Anyway, before sending the new patch, let met just answer for the
comments.
At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 19/11/2019 12:48, Kyotaro Horiguchi wrote:
1. Inserting a branch in
SearchCatCacheInternal. (CatCache_Pattern_1.patch)
This is the most straightforward way to add an alternative feature.
pattern 1 | 8459.73 | 28.15 # 9% (>> 1%) slower than 7757.58
pattern 1 | 8504.83 | 55.61
pattern 1 | 8541.81 | 41.56
pattern 1 | 8552.20 | 27.99
master | 7757.58 | 22.65
master | 7801.32 | 20.64
master | 7839.57 | 25.28
master | 7925.30 | 38.84
It's so slow that it cannot be used.
This is very surprising. A branch that's never taken ought to be
predicted by the CPU's branch-predictor, and be very cheap.
(A) original test patch
I naively thought that the code path is too short to bury the
degradation of additional a few instructions. Actually I measured
performance again with the same patch set on the current master and
had the more or less the same result.
master 8195.58ms, patched 8817.40 ms: +10.75%
However, I noticed that the additional call was a recursive call and a
jmp inserted for the recursive call seems taking significant
time. After avoiding the recursive call, the difference reduced to
+0.96% (master 8268.71ms : patched 8348.30ms)
Just two instructions below are inserted in this case, which looks
reasonable.
8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48 <catalog_cache_prune_min_age>
872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function)
(C) inserting bare counter-update code without a branch
Do we actually need a branch there? If I understand correctly, the
point is to bump up a usage counter on the catcache entry. You could
increment the counter unconditionally, even if the feature is not
used, and avoid the branch that way.
That change causes 4.9% degradation, which is worse than having a
branch.
master 8364.54ms, patched 8666.86ms (+4.9%)
The additional instructions follow.
+ 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess
+ 8721ae <+206>: mov $0x2,%edx
+ 8721b3 <+211>: add $0x1,%eax # %eax++
+ 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2
<original code>
+ 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax
+ 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock>
+ 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax
(D) naively branching then updateing, again.
Come to think of this, I measured the same with a branch again,
specifically: (It showed siginificant degradation before, in my
memory.)
dlsit_move_head(bucket, &ct->cache_elem);
+ if (catalog_cache_prune_min_age < -1) # never be true
+ {
+ (counter update)
+ }
And I had effectively the same numbers from both master and patched.
master 8066.93ms, patched 8052.37ms (-0.18%)
The above branching inserts the same two instructions with (B) into
different place but the result differs, for a reason uncertain to me.
+ 8721bb <+203>: cmpl $0xffffffff,0x4bb886(%rip) # <catalog_cache_prune_min_age>
+ 8721c2 <+210>: jl 0x872208 <SearchCatCache1+280>
I'm not sure why but the patched beats the master by a small
difference. Anyway ths new result shows that compiler might have got
smarter than before?
(E) bumping up in ReleaseCatCache() (won't work)
Another thought is to bump up the usage counter in ReleaseCatCache(),
and only when the refcount reaches zero. That might be somewhat
cheaper, if it's a common pattern to acquire additional leases on an
entry that's already referenced.Yet another thought is to replace 'refcount' with an 'acquirecount'
and 'releasecount'. In SearchCatCacheInternal(), increment
acquirecount, and in ReleaseCatCache, increment releasecount. When
they are equal, the entry is not in use. Now you have a counter that
gets incremented on every access, with the same number of CPU
instructions in the hot paths as we have today.
These don't work for negative caches, since the corresponding tuples
are never released.
(F) removing less-significant code.
Or maybe there are some other ways we could micro-optimize
SearchCatCacheInternal(), to buy back the slowdown that this feature
Yeah, I thought of that in the beginning. (I removed dlist_move_head()
at the time.) But the most difficult aspect of this approach is that
I cannot tell whether the modification never cause degradation or not.
would add? For example, you could remove the "if (cl->dead) continue;"
check, if dead entries were kept out of the hash buckets. Or maybe the
catctup struct could be made slightly smaller somehow, so that it
would fit more comfortably in a single cache line.
As a trial, I removed that code and added the ct->naccess code.
master 8187.44ms, patched 8266.74ms (+1.0%)
So the removal decreased the degradation by about 3.9% of the total
time.
My point is that I don't think we want to complicate the code much for
this. All the indirection stuff seems over-engineered for this. Let's
find a way to keep it simple.
Yes, agreed from the bottom of my heart. I aspire to find a simple way
to avoid degradation.
regars.
--
Kyotaro Horiguchi
NTT Open Source Software Center
me> First off, I thought that I managed to eliminate the degradation
me> observed on the previous versions, but significant degradation (1.1%
me> slower) is still seen in on case.
While trying benchmarking with many patterns, I noticed that it slows
down catcache search significantly to call CatCacheCleanupOldEntries()
even if the function does almost nothing. Oddly enough the
degradation gets larger if I removed the counter-updating code from
SearchCatCacheInternal. It seems that RehashCatCache is called far
frequently than I thought and CatCacheCleanupOldEntries was suffering
the branch penalty.
The degradation vanished by a likely() attached to the condition. On
the contrary patched version is constantly slightly faster than
master.
For now, I measured the patch with three access patterns as the
catcachebench was designed.
master patched-off patched-on(300s)
test 1 3898.18ms 3896.11ms (-0.1%) 3889.44ms (- 0.2%)
test 2 8013.37ms 8098.51ms (+1.1%) 8640.63ms (+ 7.8%)
test 3 6146.95ms 6147.91ms (+0.0%) 15466 ms (+152 %)
master : This patch is not applied.
patched-off: This patch is applied and catalog_cache_prune_min_age = -1
patched-on : This patch is applied and catalog_cache_prune_min_age = 0
test 1: Creates many negative entries in STATRELATTINH
(expiration doesn't happen)
test 2: Repeat fetch several negative entries for many times.
test 3: test 1 with expiration happens.
The result looks far better, but the test 2 still shows a small
degradation... I'll continue investigating it..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 06/11/2020 10:24, Kyotaro Horiguchi wrote:
Thank you for the comment!
First off, I thought that I managed to eliminate the degradation
observed on the previous versions, but significant degradation (1.1%
slower) is still seen in on case.
One thing to keep in mind with micro-benchmarks like this is that even
completely unrelated code changes can change the layout of the code in
memory, which in turn can affect CPU caching affects in surprising ways.
If you're lucky, you can see 1-5% differences just by adding a function
that's never called, for example, if it happens to move other code in
memory so that a some hot codepath or struct gets split across CPU cache
lines. It can be infuriating when benchmarking.
At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
(A) original test patchI naively thought that the code path is too short to bury the
degradation of additional a few instructions. Actually I measured
performance again with the same patch set on the current master and
had the more or less the same result.master 8195.58ms, patched 8817.40 ms: +10.75%
However, I noticed that the additional call was a recursive call and a
jmp inserted for the recursive call seems taking significant
time. After avoiding the recursive call, the difference reduced to
+0.96% (master 8268.71ms : patched 8348.30ms)Just two instructions below are inserted in this case, which looks
reasonable.8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48 <catalog_cache_prune_min_age>
872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function)
That's interesting. I think a 1% degradation would be acceptable.
I think we'd like to enable this feature by default though, so the
performance when it's enabled is also very important.
(C) inserting bare counter-update code without a branch
Do we actually need a branch there? If I understand correctly, the
point is to bump up a usage counter on the catcache entry. You could
increment the counter unconditionally, even if the feature is not
used, and avoid the branch that way.That change causes 4.9% degradation, which is worse than having a
branch.master 8364.54ms, patched 8666.86ms (+4.9%)
The additional instructions follow.
+ 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess + 8721ae <+206>: mov $0x2,%edx + 8721b3 <+211>: add $0x1,%eax # %eax++ + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2 <original code> + 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock> + 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax
Do you need the "ntaccess == 2" test? You could always increment the
counter, and in the code that uses ntaccess to decide what to evict,
treat all values >= 2 the same.
Need to handle integer overflow somehow. Or maybe not: integer overflow
is so infrequent that even if a hot syscache entry gets evicted
prematurely because its ntaccess count wrapped around to 0, it will
happen so rarely that it won't make any difference in practice.
- Heikki
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 06/11/2020 10:24, Kyotaro Horiguchi wrote:
Thank you for the comment!
First off, I thought that I managed to eliminate the degradation
observed on the previous versions, but significant degradation (1.1%
slower) is still seen in on case.One thing to keep in mind with micro-benchmarks like this is that even
completely unrelated code changes can change the layout of the code in
memory, which in turn can affect CPU caching affects in surprising
ways. If you're lucky, you can see 1-5% differences just by adding a
function that's never called, for example, if it happens to move other
code in memory so that a some hot codepath or struct gets split across
CPU cache lines. It can be infuriating when benchmarking.
True. I sometimes had to make distclean to stabilize such benchmarks..
At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas
<hlinnaka@iki.fi> wrote in
(A) original test patch
I naively thought that the code path is too short to bury the
degradation of additional a few instructions. Actually I measured
performance again with the same patch set on the current master and
had the more or less the same result.
master 8195.58ms, patched 8817.40 ms: +10.75%
However, I noticed that the additional call was a recursive call and a
jmp inserted for the recursive call seems taking significant
time. After avoiding the recursive call, the difference reduced to
+0.96% (master 8268.71ms : patched 8348.30ms)
Just two instructions below are inserted in this case, which looks
reasonable.
8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48
<catalog_cache_prune_min_age>
872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function)That's interesting. I think a 1% degradation would be acceptable.
I think we'd like to enable this feature by default though, so the
performance when it's enabled is also very important.(C) inserting bare counter-update code without a branch
Do we actually need a branch there? If I understand correctly, the
point is to bump up a usage counter on the catcache entry. You could
increment the counter unconditionally, even if the feature is not
used, and avoid the branch that way.That change causes 4.9% degradation, which is worse than having a branch. master 8364.54ms, patched 8666.86ms (+4.9%) The additional instructions follow. + 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess + 8721ae <+206>: mov $0x2,%edx + 8721b3 <+211>: add $0x1,%eax # %eax++ + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2 <original code> + 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock> + 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %raxDo you need the "ntaccess == 2" test? You could always increment the
counter, and in the code that uses ntaccess to decide what to evict,
treat all values >= 2 the same.Need to handle integer overflow somehow. Or maybe not: integer
overflow is so infrequent that even if a hot syscache entry gets
evicted prematurely because its ntaccess count wrapped around to 0, it
will happen so rarely that it won't make any difference in practice.
Agreed. Ok, I have prioritized completely avoiding degradation on the
normal path, but laxing that restriction to 1% or so makes the code
far simpler and make the expiration path signifinicantly faster.
Now the branch for counter-increment is removed. For similar
branches for counter-decrement side in CatCacheCleanupOldEntries(),
Min() is compiled into cmovbe and a branch was removed.
At Mon, 09 Nov 2020 11:13:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Now the branch for counter-increment is removed. For similar
branches for counter-decrement side in CatCacheCleanupOldEntries(),
Min() is compiled into cmovbe and a branch was removed.
Mmm. Sorry, I sent this by a mistake. Please ignore it.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
Do you need the "ntaccess == 2" test? You could always increment the
counter, and in the code that uses ntaccess to decide what to evict,
treat all values >= 2 the same.Need to handle integer overflow somehow. Or maybe not: integer
overflow is so infrequent that even if a hot syscache entry gets
evicted prematurely because its ntaccess count wrapped around to 0, it
will happen so rarely that it won't make any difference in practice.
That relaxing simplifies the code significantly, but a significant
degradation by about 5% still exists.
(SearchCatCacheInternal())
+ ct->naccess++;
!+ ct->lastaccess = catcacheclock;
If I removed the second line above, the degradation disappears
(-0.7%). However, I don't find the corresponding numbers in the output
of perf. The sum of the numbers for the removed instructions is (0.02
+ 0.28 = 0.3%). I don't think the degradation as the whole doesn't
always reflect to the instruction level profiling, but I'm stuck here,
anyway.
% samples
master p2 patched (p2 = patched - "ct->lastaccess = catcacheclock)
=============================================================================
0.47 | 0.27 | 0.17 | mov %rbx,0x8(%rbp)
| | | SearchCatCacheInternal():
| | | ct->naccess++;
| | | ct->lastaccess = catcacheclock;
----- |----- | 0.02 |10f: mov catcacheclock,%rax
| | | ct->naccess++;
----- | 0.96 | 1.00 | addl $0x1,0x14(%rbx)
| | | return NULL;
----- | 0.11 | 0.16 | xor %ebp,%ebp
| | | if (!ct->negative)
0.27 | 0.30 | 0.03 | cmpb $0x0,0x21(%rbx)
| | | ct->lastaccess = catcacheclock;
----- | ---- | 0.28 | mov %rax,0x18(%rbx)
| | | if (!ct->negative)
0.34 | 0.08 | 0.59 | ↓ jne 149
For your information, the same table for a bit wider range follows.
% samples
master p2 patched (p2 = patched - "ct->lastaccess = catcacheclock)
=============================================================================
| | | dlist_foreach(iter, bucket)
6.91 | 7.06 | 5.89 | mov 0x8(%rbp),%rbx
0.78 | 0.73 | 0.81 | test %rbx,%rbx
| | | ↓ je 160
| | | cmp %rbx,%rbp
0.46 | 0.52 | 0.39 | ↓ jne 9d
| | | ↓ jmpq 160
| | | nop
5.68 | 5.54 | 6.03 | 90: mov 0x8(%rbx),%rbx
1.44 | 1.42 | 1.43 | cmp %rbx,%rbp
| | | ↓ je 160
| | | {
| | | ct = dlist_container(CatCTup, cache_elem, iter.cur);
| | |
| | | if (ct->dead)
30.36 |30.97 | 31.48 | 9d: cmpb $0x0,0x20(%rbx)
2.63 | 2.60 | 2.69 | ↑ jne 90
| | | continue; /* ignore dead entries */
| | |
| | | if (ct->hash_value != hashValue)
1.41 | 1.37 | 1.35 | cmp -0x24(%rbx),%edx
3.19 | 2.97 | 2.87 | ↑ jne 90
7.17 | 5.53 | 6.89 | mov %r13,%rsi
0.02 | 0.04 | 0.04 | xor %r12d,%r12d
3.00 | 2.98 | 2.95 | ↓ jmp b5
0.15 | 0.61 | 0.20 | b0: mov 0x10(%rsp,%r12,1),%rsi
6.58 | 5.04 | 5.95 | b5: mov %ecx,0xc(%rsp)
| | | CatalogCacheCompareTuple():
| | | if (!(cc_fastequal[i]) (cachekeys[i], searchkeys[i]))
1.51 | 0.92 | 1.66 | mov -0x20(%rbx,%r12,1),%rdi
0.54 | 1.64 | 0.58 | mov %edx,0x8(%rsp)
3.78 | 3.11 | 3.86 | → callq *0x38(%r14,%r12,1)
0.43 | 2.30 | 0.34 | mov 0x8(%rsp),%edx
0.20 | 0.94 | 0.25 | mov 0xc(%rsp),%ecx
0.44 | 0.41 | 0.44 | test %al,%al
| | | ↑ je 90
| | | for (i = 0; i < nkeys; i++)
2.28 | 1.07 | 2.26 | add $0x8,%r12
0.08 | 0.23 | 0.07 | cmp $0x18,%r12
0.11 | 0.64 | 0.10 | ↑ jne b0
| | | dlist_move_head():
| | | */
| | | static inline void
| | | dlist_move_head(dlist_head *head, dlist_node *node)
| | | {
| | | /* fast path if it's already at the head */
| | | if (head->head.next == node)
0.08 | 0.61 | 0.04 | cmp 0x8(%rbp),%rbx
0.02 | 0.10 | 0.00 | ↓ je 10f
| | | return;
| | |
| | | dlist_delete(node);
0.01 | 0.20 | 0.06 | mov 0x8(%rbx),%rax
| | | dlist_delete():
| | | node->prev->next = node->next;
0.75 | 0.13 | 0.72 | mov (%rbx),%rdx
2.89 | 3.42 | 2.22 | mov %rax,0x8(%rdx)
| | | node->next->prev = node->prev;
0.01 | 0.09 | 0.00 | mov (%rbx),%rdx
0.04 | 0.62 | 0.58 | mov %rdx,(%rax)
| | | dlist_push_head():
| | | if (head->head.next == NULL) /* convert NULL header to circular */
0.31 | 0.08 | 0.28 | mov 0x8(%rbp),%rax
0.55 | 0.44 | 0.28 | test %rax,%rax
| | | ↓ je 180
| | | node->next = head->head.next;
0.00 | 0.08 | 0.06 |101: mov %rax,0x8(%rbx)
| | | node->prev = &head->head;
0.17 | 0.73 | 0.37 | mov %rbp,(%rbx)
| | | node->next->prev = node;
0.34 | 0.08 | 1.13 | mov %rbx,(%rax)
| | | head->head.next = node;
0.47 | 0.27 | 0.17 | mov %rbx,0x8(%rbp)
| | | SearchCatCacheInternal():
| | | ct->naccess++;
| | | ct->lastaccess = catcacheclock;
----- |----- | 0.02 |10f: mov catcacheclock,%rax
| | | ct->naccess++;
----- | 0.96 | 1.00 | addl $0x1,0x14(%rbx)
| | | return NULL;
----- | 0.11 | 0.16 | xor %ebp,%ebp
| | | if (!ct->negative)
0.27 | 0.30 | 0.03 | cmpb $0x0,0x21(%rbx)
| | | ct->lastaccess = catcacheclock;
----- | ---- | 0.28 | mov %rax,0x18(%rbx)
| | | if (!ct->negative)
0.34 | 0.08 | 0.59 | ↓ jne 149
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On 09/11/2020 11:34, Kyotaro Horiguchi wrote:
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
Do you need the "ntaccess == 2" test? You could always increment the
counter, and in the code that uses ntaccess to decide what to evict,
treat all values >= 2 the same.Need to handle integer overflow somehow. Or maybe not: integer
overflow is so infrequent that even if a hot syscache entry gets
evicted prematurely because its ntaccess count wrapped around to 0, it
will happen so rarely that it won't make any difference in practice.That relaxing simplifies the code significantly, but a significant
degradation by about 5% still exists.(SearchCatCacheInternal())
+ ct->naccess++;
!+ ct->lastaccess = catcacheclock;If I removed the second line above, the degradation disappears
(-0.7%).
0.7% degradation is probably acceptable.
However, I don't find the corresponding numbers in the output
of perf. The sum of the numbers for the removed instructions is (0.02
+ 0.28 = 0.3%). I don't think the degradation as the whole doesn't
always reflect to the instruction level profiling, but I'm stuck here,
anyway.
Hmm. Some kind of cache miss effect, perhaps? offsetof(CatCTup, tuple)
is exactly 64 bytes currently, so any fields that you add after 'tuple'
will go on a different cache line. Maybe it would help if you just move
the new fields before 'tuple'.
Making CatCTup smaller might help. Some ideas/observations:
- The 'ct_magic' field is only used for assertion checks. Could remove it.
- 4 Datums (32 bytes) are allocated for the keys, even though most
catcaches have fewer key columns.
- In the current syscaches, keys[2] and keys[3] are only used to store
32-bit oids or some other smaller fields. Allocating a full 64-bit Datum
for them wastes memory.
- You could move the dead flag at the end of the struct or remove it
altogether, with the change I mentioned earlier to not keep dead items
in the buckets
- You could steal a few bit for dead/negative flags from some other
field. Use special values for tuple.t_len for them or something.
With some of these tricks, you could shrink CatCTup so that the new
lastaccess and naccess fields would fit in the same cacheline.
That said, I think this is good enough performance-wise as it is. So if
we want to improve performance in general, that can be a separate patch.
- Heikki
On Tue, Nov 17, 2020 at 10:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
0.7% degradation is probably acceptable.
I haven't looked at this patch in a while and I'm pleased with the way
it seems to have been redesigned. It seems relatively simple and
unlikely to cause big headaches. I would say that 0.7% is probably not
acceptable on a general workload, but it seems fine on a benchmark
that is specifically designed to be a worst-case for this patch, which
I gather is what's happening here. I think it would be nice if we
could enable this feature by default. Does it cause a measurable
regression on realistic workloads when enabled? I bet a default of 5
or 10 minutes would help many users.
One idea for improving things might be to move the "return
immediately" tests in CatCacheCleanupOldEntries() to the caller, and
only call this function if they indicate that there is some purpose.
This would avoid the function call overhead when nothing can be done.
Perhaps the two tests could be combined into one and simplified. Like,
suppose the code looks (roughly) like this:
if (catcacheclock >= time_at_which_we_can_prune)
CatCacheCleanupOldEntries(...);
To make it that simple, we want catcacheclock and
time_at_which_we_can_prune to be stored as bare uint64 quantities so
we don't need TimestampDifference(). And we want
time_at_which_we_can_prune to be set to PG_UINT64_MAX when the feature
is disabled. But those both seem like pretty achievable things... and
it seems like the result would probably be faster than what you have
now.
+ * per-statement basis and additionaly udpated periodically
two words spelled wrong
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+ catalog_cache_prune_min_age = newval;
+}
hmm, do we need this?
+ /*
+ * Entries that are not accessed after the last pruning
+ * are removed in that seconds, and their lives are
+ * prolonged according to how many times they are accessed
+ * up to three times of the duration. We don't try shrink
+ * buckets since pruning effectively caps catcache
+ * expansion in the long term.
+ */
+ ct->naccess = Min(2, ct->naccess);
The code doesn't match the comment, it seems, because the limit here
is 2, not 3. I wonder if this does anything anyway. My intuition is
that when a catcache entry gets accessed at all it's probably likely
to get accessed a bunch of times. If there are any meaningful
thresholds here I'd expect us to be trying to distinguish things like
1000+ accesses vs. 100-1000 vs. 10-100 vs. 1-10. Or maybe we don't
need to distinguish at all and can just have a single mark bit rather
than a counter.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
At Tue, 17 Nov 2020 17:46:25 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 09/11/2020 11:34, Kyotaro Horiguchi wrote:
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi>
wrote inDo you need the "ntaccess == 2" test? You could always increment the
counter, and in the code that uses ntaccess to decide what to evict,
treat all values >= 2 the same.Need to handle integer overflow somehow. Or maybe not: integer
overflow is so infrequent that even if a hot syscache entry gets
evicted prematurely because its ntaccess count wrapped around to 0, it
will happen so rarely that it won't make any difference in practice.That relaxing simplifies the code significantly, but a significant
degradation by about 5% still exists.
(SearchCatCacheInternal())
+ ct->naccess++;
!+ ct->lastaccess = catcacheclock;
If I removed the second line above, the degradation disappears
(-0.7%).0.7% degradation is probably acceptable.
Sorry for the confusion "-0.7% degradation" meant "+0.7% gain".
However, I don't find the corresponding numbers in the output
of perf. The sum of the numbers for the removed instructions is (0.02
+ 0.28 = 0.3%). I don't think the degradation as the whole doesn't
always reflect to the instruction level profiling, but I'm stuck here,
anyway.Hmm. Some kind of cache miss effect, perhaps? offsetof(CatCTup, tuple) is
Shouldn't it be seen in the perf result?
exactly 64 bytes currently, so any fields that you add after 'tuple' will go
on a different cache line. Maybe it would help if you just move the new fields
before 'tuple'.Making CatCTup smaller might help. Some ideas/observations:
- The 'ct_magic' field is only used for assertion checks. Could remove it.
Ok, removed.
- 4 Datums (32 bytes) are allocated for the keys, even though most catcaches
- have fewer key columns.
- In the current syscaches, keys[2] and keys[3] are only used to store 32-bit
- oids or some other smaller fields. Allocating a full 64-bit Datum for them
- wastes memory.
It seems to be the last resort.
- You could move the dead flag at the end of the struct or remove it
- altogether, with the change I mentioned earlier to not keep dead items in
- the buckets
This seems most promising so I did this. One annoyance is we need to
know whether a catcache tuple is invalidated or not to judge whether
to remove it. I used CatCtop.cache_elem.prev to signal the same in
the next version.
- You could steal a few bit for dead/negative flags from some other field. Use
- special values for tuple.t_len for them or something.
I stealed the MSB of refcount as negative, but the bit masking
operations seems making the function slower. The benchmark-2 gets
slower by around +2% as the total.
With some of these tricks, you could shrink CatCTup so that the new lastaccess
and naccess fields would fit in the same cacheline.That said, I think this is good enough performance-wise as it is. So if we
want to improve performance in general, that can be a separate patch.
Removing CatCTup.dead increased the performance of catcache search
significantly, but catcache entry creation gets slower for uncertain
rasons..
(Continues to a reply to Robert's comment)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Thank you for the comments.
At Tue, 17 Nov 2020 16:22:54 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
On Tue, Nov 17, 2020 at 10:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
0.7% degradation is probably acceptable.
I haven't looked at this patch in a while and I'm pleased with the way
it seems to have been redesigned. It seems relatively simple and
unlikely to cause big headaches. I would say that 0.7% is probably not
acceptable on a general workload, but it seems fine on a benchmark
Sorry for the confusing notation, "-0.7% degradation" meant +0.7%
*gain*, which I thinks is error. However, the next patch makes
catcache apparently *faster* so the difference doesn't matter..
that is specifically designed to be a worst-case for this patch, which
I gather is what's happening here. I think it would be nice if we
could enable this feature by default. Does it cause a measurable
regression on realistic workloads when enabled? I bet a default of 5
or 10 minutes would help many users.One idea for improving things might be to move the "return
immediately" tests in CatCacheCleanupOldEntries() to the caller, and
only call this function if they indicate that there is some purpose.
This would avoid the function call overhead when nothing can be done.
Perhaps the two tests could be combined into one and simplified. Like,
suppose the code looks (roughly) like this:if (catcacheclock >= time_at_which_we_can_prune)
CatCacheCleanupOldEntries(...);
Compiler removes the call (or inlines the function) but of course we
can write that way and it shows the condition for calling the function
better. The codelet above forgetting consideration on the result of
CatCacheCleanupOldEntries() itself. The function returns false when
all "old" entries have been invalidated or explicitly removed and we
need to expand the hash in that case.
To make it that simple, we want catcacheclock and
time_at_which_we_can_prune to be stored as bare uint64 quantities so
we don't need TimestampDifference(). And we want
time_at_which_we_can_prune to be set to PG_UINT64_MAX when the feature
is disabled. But those both seem like pretty achievable things... and
it seems like the result would probably be faster than what you have
now.
The time_at_which_we_can_prune is not global but catcache-local and
needs to change at the time catalog_cache_prune_min_age is changed.
So the next version does as the follwoing:
- if (CatCacheCleanupOldEntries(cp))
+ if (catcacheclock - cp->cc_oldest_ts > prune_min_age_us &&
+ CatCacheCleanupOldEntries(cp))
On the other hand CatCacheCleanupOldEntries can calcualte the
time_at_which_we_can_prune once at the beginning of the function. That
makes the condition in the loop simpler.
- TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
-
- if (age > catalog_cache_prune_min_age)
+ if (ct->lastaccess < prune_threshold)
{
+ * per-statement basis and additionaly udpated periodically
two words spelled wrong
Ugg. Fixed. Checked all spellings and found another misspelling.
+void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +}hmm, do we need this?
*That* is actually useless, but the function is kept and not it
maintains the internal-version of the GUC parameter (uint64
prune_min_age).
+ /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + ct->naccess = Min(2, ct->naccess);The code doesn't match the comment, it seems, because the limit here
is 2, not 3. I wonder if this does anything anyway. My intuition is
that when a catcache entry gets accessed at all it's probably likely
to get accessed a bunch of times. If there are any meaningful
thresholds here I'd expect us to be trying to distinguish things like
1000+ accesses vs. 100-1000 vs. 10-100 vs. 1-10. Or maybe we don't
need to distinguish at all and can just have a single mark bit rather
than a counter.
Agreed. Since I don't see a clear criteria for the threshold of the
counter, I removed the naccess and related lines.
I did the following changes in the attached.
1. Removed naccess and related lines.
2. Moved the precheck condition out of CatCacheCleanupOldEntries() to
RehashCatCache().
3. Use uint64 direct comparison instead of TimestampDifference().
4. Removed CatCTup.dead flag.
Performance measurement on the attached showed better result about
searching but maybe worse for cache entry creation. Each time number
is the mean of 10 runs.
# Cacache (negative) entry creation
: time(ms) (% to master)
master : 3965.61 (100.0)
patched-off: 4040.93 (101.9)
patched-on : 4032.22 (101.7)
# Searching negative cache entries
master : 8173.46 (100.0)
patched-off: 7983.43 ( 97.7)
patched-on : 8049.88 ( 98.5)
# Creation, searching and expiration
master : 6393.23 (100.0)
patched-off: 6527.94 (102.1)
patched-on : 15880.01 (248.4)
That is, catcache searching gets faster by 2-3% but creation gets
slower by about 2%. If I moved the condition of 2 further up to
CatalogCacheCreateEntry(), that degradation reduced to 0.6%.
# Cacache (negative) entry creation
master : 3967.45 (100.0)
patched-off : 3990.43 (100.6)
patched-on : 4108.96 (103.6)
# Searching negative cache entries
master : 8106.53 (100.0)
patched-off : 8036.61 ( 99.1)
patched-on : 8058.18 ( 99.4)
# Creation, searching and expiration
master : 6395.00 (100.0)
patched-off : 6416.57 (100.3)
patched-on : 15830.91 (247.6)
It doesn't get smaller if I reverted the changed lines in
CatalogCacheCreateEntry()..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hi,
On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
# Creation, searching and expiration
master : 6393.23 (100.0)
patched-off: 6527.94 (102.1)
patched-on : 15880.01 (248.4)
What's the deal with this massive increase here?
Greetings,
Andres Freund
At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
# Creation, searching and expiration
master : 6393.23 (100.0)
patched-off: 6527.94 (102.1)
patched-on : 15880.01 (248.4)What's the deal with this massive increase here?
CatCacheRemovedCTup(). If I replaced a call to the function in the
cleanup functoin with dlist_delete(), the result changes as:
master : 6372.04 (100.0) (2)
patched-off : 6464.97 (101.5) (2)
patched-on : 5354.42 ( 84.0) (2)
We could boost the expiration if we reuse the "deleted" entry at the
next entry creation.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 19 Nov 2020 15:23:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
# Creation, searching and expiration
master : 6393.23 (100.0)
patched-off: 6527.94 (102.1)
patched-on : 15880.01 (248.4)What's the deal with this massive increase here?
CatCacheRemovedCTup(). If I replaced a call to the function in the
cleanup functoin with dlist_delete(), the result changes as:master : 6372.04 (100.0) (2)
patched-off : 6464.97 (101.5) (2)
patched-on : 5354.42 ( 84.0) (2)We could boost the expiration if we reuse the "deleted" entry at the
next entry creation.
That result should be bogus. It forgot to update cc_ntup..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Ah. It was obvious from the first.
Sorry for the sloppy diagnosis.
At Fri, 20 Nov 2020 16:08:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Thu, 19 Nov 2020 15:23:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
# Creation, searching and expiration
master : 6393.23 (100.0)
patched-off: 6527.94 (102.1)
patched-on : 15880.01 (248.4)What's the deal with this massive increase here?
catalog_cache_min_prune_age was set to 0 at the time, so almost all
catcache entries are dropped at rehashing time. Most of the difference
should be the time to search on the system catalog.
2020-11-20 16:25:25.988 LOG: database system is ready to accept connections
2020-11-20 16:26:48.504 LOG: Catcache reset
2020-11-20 16:26:48.504 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 257: 0.001500 ms
2020-11-20 16:26:48.504 LOG: rehashed catalog cache id 58 for pg_statistic; 257 tups, 256 buckets, 0.020748 ms
2020-11-20 16:26:48.505 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 513: 0.003221 ms
2020-11-20 16:26:48.505 LOG: rehashed catalog cache id 58 for pg_statistic; 513 tups, 512 buckets, 0.006962 ms
2020-11-20 16:26:48.505 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 1025: 0.006744 ms
2020-11-20 16:26:48.505 LOG: rehashed catalog cache id 58 for pg_statistic; 1025 tups, 1024 buckets, 0.009580 ms
2020-11-20 16:26:48.507 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 2049: 0.015683 ms
2020-11-20 16:26:48.507 LOG: rehashed catalog cache id 58 for pg_statistic; 2049 tups, 2048 buckets, 0.041008 ms
2020-11-20 16:26:48.509 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 4097: 0.042438 ms
2020-11-20 16:26:48.509 LOG: rehashed catalog cache id 58 for pg_statistic; 4097 tups, 4096 buckets, 0.077379 ms
2020-11-20 16:26:48.515 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 8193: 0.123798 ms
2020-11-20 16:26:48.515 LOG: rehashed catalog cache id 58 for pg_statistic; 8193 tups, 8192 buckets, 0.198505 ms
2020-11-20 16:26:48.525 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 16385: 0.180831 ms
2020-11-20 16:26:48.526 LOG: rehashed catalog cache id 58 for pg_statistic; 16385 tups, 16384 buckets, 0.361109 ms
2020-11-20 16:26:48.546 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 32769: 0.717899 ms
2020-11-20 16:26:48.547 LOG: rehashed catalog cache id 58 for pg_statistic; 32769 tups, 32768 buckets, 1.443587 ms
2020-11-20 16:26:48.588 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 65537: 1.204804 ms
2020-11-20 16:26:48.591 LOG: rehashed catalog cache id 58 for pg_statistic; 65537 tups, 65536 buckets, 3.069916 ms
2020-11-20 16:26:48.674 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 131073: 2.707709 ms
2020-11-20 16:26:48.681 LOG: rehashed catalog cache id 58 for pg_statistic; 131073 tups, 131072 buckets, 7.127622 ms
2020-11-20 16:26:48.848 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 262145: 5.895630 ms
2020-11-20 16:26:48.862 LOG: rehashed catalog cache id 58 for pg_statistic; 262145 tups, 262144 buckets, 13.433610 ms
2020-11-20 16:26:49.195 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 524289: 12.302632 ms
2020-11-20 16:26:49.223 LOG: rehashed catalog cache id 58 for pg_statistic; 524289 tups, 524288 buckets, 27.710900 ms
2020-11-20 16:26:49.937 LOG: pruning catalog cache id=58 for pg_statistic: removed 1001000 / 1048577: 66.062629 ms
2020-11-20 16:26:51.195 LOG: pruning catalog cache id=58 for pg_statistic: removed 1002001 / 1048577: 65.533468 ms
2020-11-20 16:26:52.413 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 1048577: 25.623740 ms
2020-11-20 16:26:52.468 LOG: rehashed catalog cache id 58 for pg_statistic; 1048577 tups, 1048576 buckets, 54.314825 ms
2020-11-20 16:26:53.898 LOG: pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 134.530582 ms
2020-11-20 16:26:56.404 LOG: pruning catalog cache id=58 for pg_statistic: removed 1002001 / 2097153: 111.634597 ms
2020-11-20 16:26:57.779 LOG: pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 134.628430 ms
2020-11-20 16:27:00.389 LOG: pruning catalog cache id=58 for pg_statistic: removed 1002001 / 2097153: 147.221688 ms
2020-11-20 16:27:01.851 LOG: pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 177.610820 ms
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
The commit 4656e3d668 (debug_invalidate_system_caches_always)
conflicted with this patch. Rebased.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Hi,
On 19/11/2020 07:25, Kyotaro Horiguchi wrote:
Performance measurement on the attached showed better result about
searching but maybe worse for cache entry creation. Each time number
is the mean of 10 runs.# Cacache (negative) entry creation
: time(ms) (% to master)
master : 3965.61 (100.0)
patched-off: 4040.93 (101.9)
patched-on : 4032.22 (101.7)# Searching negative cache entries
master : 8173.46 (100.0)
patched-off: 7983.43 ( 97.7)
patched-on : 8049.88 ( 98.5)# Creation, searching and expiration
master : 6393.23 (100.0)
patched-off: 6527.94 (102.1)
patched-on : 15880.01 (248.4)That is, catcache searching gets faster by 2-3% but creation gets
slower by about 2%. If I moved the condition of 2 further up to
CatalogCacheCreateEntry(), that degradation reduced to 0.6%.# Cacache (negative) entry creation
master : 3967.45 (100.0)
patched-off : 3990.43 (100.6)
patched-on : 4108.96 (103.6)# Searching negative cache entries
master : 8106.53 (100.0)
patched-off : 8036.61 ( 99.1)
patched-on : 8058.18 ( 99.4)# Creation, searching and expiration
master : 6395.00 (100.0)
patched-off : 6416.57 (100.3)
patched-on : 15830.91 (247.6)
Can you share the exact script or steps to reproduce these numbers? I
presume these are from the catcachebench extension, but I can't figure
out which scenario above corresponds to which catcachebench test. Also,
catcachebench seems to depend on a bunch of tables being created in
schema called "test"; what tables did you use for the above numbers?
- Heikki
At Tue, 26 Jan 2021 11:43:21 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
Hi,
On 19/11/2020 07:25, Kyotaro Horiguchi wrote:
Performance measurement on the attached showed better result about
searching but maybe worse for cache entry creation. Each time number
is the mean of 10 runs.
# Cacache (negative) entry creation
: time(ms) (% to master)
master : 3965.61 (100.0)
patched-off: 4040.93 (101.9)
patched-on : 4032.22 (101.7)
# Searching negative cache entries
master : 8173.46 (100.0)
patched-off: 7983.43 ( 97.7)
patched-on : 8049.88 ( 98.5)
# Creation, searching and expiration
master : 6393.23 (100.0)
patched-off: 6527.94 (102.1)
patched-on : 15880.01 (248.4)
That is, catcache searching gets faster by 2-3% but creation gets
slower by about 2%. If I moved the condition of 2 further up to
CatalogCacheCreateEntry(), that degradation reduced to 0.6%.
# Cacache (negative) entry creation
master : 3967.45 (100.0)
patched-off : 3990.43 (100.6)
patched-on : 4108.96 (103.6)
# Searching negative cache entries
master : 8106.53 (100.0)
patched-off : 8036.61 ( 99.1)
patched-on : 8058.18 ( 99.4)
# Creation, searching and expiration
master : 6395.00 (100.0)
patched-off : 6416.57 (100.3)
patched-on : 15830.91 (247.6)Can you share the exact script or steps to reproduce these numbers? I
presume these are from the catcachebench extension, but I can't figure
out which scenario above corresponds to which catcachebench
test. Also, catcachebench seems to depend on a bunch of tables being
created in schema called "test"; what tables did you use for the above
numbers?
gen_tbl.pl to generate the tables, then run2.sh to run the
benchmark. sumlog.pl to summarize the result of run2.sh.
$ ./gen_tbl.pl | psql postgres
$ ./run2.sh | tee rawresult.txt | ./sumlog.pl
(I found a bug in a benchmark-aid function
(CatalogCacheFlushCatalog2), I repost an updated version soon.)
Simple explanation follows since the scripts are a kind of crappy..
run2.sh:
LOOPS : # of execution of catcachebench() in a single run
USES : Take the average of this number of fastest executions in a
single run.
BINROOT : Common parent directory of target binaries.
DATADIR : Data directory. (shared by all binaries)
PREC : FP format for time and percentage in a result.
TESTS : comma-separated numbers given to catcachebench.
The "run" function spec
run "binary-label" <binary-path> <A> <B> <C>
where A, B and C are the value for catalog_cache_prune_min_age. ""
means no setting (used for master binary). Currently only C is in
effect but all the three should be non-empty string to make it
effective.
The result output is:
test | version | n | r | stddev
------+-------------+-----+----------+---------
1 | patched-off | 1/3 | 14211.96 | 261.19
test : # of catcachebench(#)
version: binary label given to the run function
n : USES / LOOPS
r : result average time of catcachebench() in milliseconds
stddev : stddev of
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
The commit 4656e3d668 (debug_invalidate_system_caches_always)
conflicted with this patch. Rebased.
At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
(I found a bug in a benchmark-aid function
(CatalogCacheFlushCatalog2), I repost an updated version soon.)
I noticed that a catcachebench-aid function
CatalogCacheFlushCatalog2() allocates bucked array wrongly in the
current memory context, which leads to a crash.
This is a fixed it then rebased version.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Import Notes
Reply to msg id not found: 20210114.173227.2164214613216760969.horikyota.ntt@gmail.com20210127.100747.1621298895510907596.horikyota.ntt@gmail.com
On 27/01/2021 03:13, Kyotaro Horiguchi wrote:
At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
The commit 4656e3d668 (debug_invalidate_system_caches_always)
conflicted with this patch. Rebased.At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
(I found a bug in a benchmark-aid function
(CatalogCacheFlushCatalog2), I repost an updated version soon.)I noticed that a catcachebench-aid function
CatalogCacheFlushCatalog2() allocates bucked array wrongly in the
current memory context, which leads to a crash.This is a fixed it then rebased version.
Thanks, with the scripts you provided, I was able to run the performance
tests on my laptop, and got very similar results as you did.
The impact of v7-0002-Remove-dead-flag-from-catcache-tuple.patch is very
small. I think I could see it in the tests, but only barely. And the
tests did nothing else than do syscache lookups; in any real world
scenario, it would be lost in noise. I think we can put that aside for
now, and focus on v6-0001-CatCache-expiration-feature.patch:
The pruning is still pretty lethargic:
- Entries created in the same transactions are never pruned away
- The size of the hash table is never shrunk. So even though the patch
puts a backstop to the hash table growing indefinitely, if you run one
transaction that bloats the cache, it's bloated for the rest of the session.
I think that's OK. We might want to be more aggressive in the future,
but for now it seems reasonable to lean towards the current behavior
where nothing is pruned. Although I wonder if we should try to set
'catcacheclock' more aggressively. I think we could set it whenever
GetCurrentTimestamp() is called, for example.
Given how unaggressive this mechanism is, I think it should be safe to
enable it by default. What would be a suitable default for
catalog_cache_prune_min_age? 30 seconds?
Documentation needs to be updated for the new GUC.
Attached is a version with a few little cleanups:
- use TimestampTz instead of uint64 for the timestamps
- remove assign_catalog_cache_prune_min_age(). All it did was convert
the GUC's value from seconds to microseconds, and stored it in a
separate variable. Multiplication is cheap, so we can just do it when we
use the GUC's value instead.
- Heikki
Attachments:
At Wed, 27 Jan 2021 13:11:55 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 27/01/2021 03:13, Kyotaro Horiguchi wrote:
At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote inThe commit 4656e3d668 (debug_invalidate_system_caches_always)
conflicted with this patch. Rebased.At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in(I found a bug in a benchmark-aid function
(CatalogCacheFlushCatalog2), I repost an updated version soon.)I noticed that a catcachebench-aid function
CatalogCacheFlushCatalog2() allocates bucked array wrongly in the
current memory context, which leads to a crash.
This is a fixed it then rebased version.Thanks, with the scripts you provided, I was able to run the
performance tests on my laptop, and got very similar results as you
did.The impact of v7-0002-Remove-dead-flag-from-catcache-tuple.patch is
very small. I think I could see it in the tests, but only barely. And
the tests did nothing else than do syscache lookups; in any real world
scenario, it would be lost in noise. I think we can put that aside for
now, and focus on v6-0001-CatCache-expiration-feature.patch:
I agree to that opinion. But a bit dissapointing that the long
struggle ended up in vain:p
The pruning is still pretty lethargic:
- Entries created in the same transactions are never pruned away
- The size of the hash table is never shrunk. So even though the patch
- puts a backstop to the hash table growing indefinitely, if you run one
- transaction that bloats the cache, it's bloated for the rest of the
- session.
Right. But more frequent check impacts on performance. We can do more
aggressive pruning in idle-time.
I think that's OK. We might want to be more aggressive in the future,
but for now it seems reasonable to lean towards the current behavior
where nothing is pruned. Although I wonder if we should try to set
'catcacheclock' more aggressively. I think we could set it whenever
GetCurrentTimestamp() is called, for example.
Ah. I didn't thought that direction. global_last_acquired_timestamp or
such?
Given how unaggressive this mechanism is, I think it should be safe to
enable it by default. What would be a suitable default for
catalog_cache_prune_min_age? 30 seconds?
Without a detailed thought, it seems a bit too short. The
ever-suggested value for the variable is 300-600s. That is,
intermittent queries with about 5-10 minutes intervals don't lose
cache entries.
In a bad case, two queries alternately remove each other's cache
entries.
Q1: adds 100 entries
<1 minute passed>
Q2: adds 100 entries but rehash is going to happen at 150 entries and
the existing 100 entreis added by Q1 are removed.
<1 minute passed>
Q1: adds 100 entries but rehash is going to happen at 150 entries and
the existing 100 entreis added by Q2 are removed.
<repeats>
Or a transaction sequence persists longer than that seconds may lose
some of the catcache entries.
Documentation needs to be updated for the new GUC.
Attached is a version with a few little cleanups:
- use TimestampTz instead of uint64 for the timestamps
- remove assign_catalog_cache_prune_min_age(). All it did was convert
- the GUC's value from seconds to microseconds, and stored it in a
- separate variable. Multiplication is cheap, so we can just do it when
- we use the GUC's value instead.
Yeah, the laater is a trace of the struggle for cutting down cpu
cycles in the normal paths. I don't object to do so.
I found that some comments are apparently stale. cp->cc_oldest_ts is
not used anywhere, but it is added for the decision of whether to scan
or not.
I fixed the following points in the attached.
- Removed some comments that is obvious. ("Timestamp in us")
- Added cp->cc_oldest_ts check in CatCacheCleanupOldEntries.
- Set the default value for catalog_cache_prune_min_age to 600s.
- Added a doc entry for the new GUC in the resoruce/memory section.
- Fix some code comments.
- Adjust pruning criteria from (ct->lastaccess < prune_threshold) to <=.
I was going to write in the doc something like "you can inspect memory
consumption by catalog caches using pg_backend_memory_contexts", but
all the memory used by catalog cache is in CacheMemoryContext. Is it
sensible for each catalog cache to have their own contexts?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Thu, 28 Jan 2021 16:50:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I was going to write in the doc something like "you can inspect memory
consumption by catalog caches using pg_backend_memory_contexts", but
all the memory used by catalog cache is in CacheMemoryContext. Is it
sensible for each catalog cache to have their own contexts?
Something like this.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Thu, Jan 28, 2021 at 05:16:52PM +0900, Kyotaro Horiguchi wrote:
At Thu, 28 Jan 2021 16:50:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I was going to write in the doc something like "you can inspect memory
consumption by catalog caches using pg_backend_memory_contexts", but
all the memory used by catalog cache is in CacheMemoryContext. Is it
sensible for each catalog cache to have their own contexts?Something like this.
Is this feature not going to make it into PG 14? It first appeared in
the January, 2017 commitfest:
https://commitfest.postgresql.org/32/931/
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
If only the physical world exists, free will is an illusion.
At Mon, 22 Mar 2021 13:12:10 -0400, Bruce Momjian <bruce@momjian.us> wrote in
On Thu, Jan 28, 2021 at 05:16:52PM +0900, Kyotaro Horiguchi wrote:
At Thu, 28 Jan 2021 16:50:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I was going to write in the doc something like "you can inspect memory
consumption by catalog caches using pg_backend_memory_contexts", but
all the memory used by catalog cache is in CacheMemoryContext. Is it
sensible for each catalog cache to have their own contexts?Something like this.
Is this feature not going to make it into PG 14? It first appeared in
the January, 2017 commitfest:
Thank you for looking this. However, I'm afraid that you are looking
to a patch which is not a part of the project in CF, "Protect syscache
<blah>". I'm happy if it is committed.
It is intending not only to show more meaningful information by
pg_get_backend_memory_contexts(), but also to make it easy to
investigate what kind of cache is bloating or something like that.
With the patch, the functions shows individual context information
lines for catcaches.
postgres=# select pg_get_backend_memory_contexts();
...
(catcache,"catcache id 78",CacheMemoryContext,2,8192,1,6152,0,2040)
(catcache,"catcache id 77",CacheMemoryContext,2,8192,1,6152,0,2040)
(catcache,"catcache id 76",CacheMemoryContext,2,16384,2,7592,3,8792)
Applying catcachecxt_by_name.patch.txt on top of it changes the output
as the following. The names are not familiar to users, but give far
clearer information.
(catcache,USERMAPPINGUSERSERVER,CacheMemoryContext,2,8192,1,6192,0,2000)
(catcache,USERMAPPINGOID,CacheMemoryContext,2,8192,1,6192,0,2000)
(catcache,TYPEOID,CacheMemoryContext,2,16384,2,7632,0,8752)
Applying catcachecxt_by_name_id.patch.xt on top of the _by_name.patch,
the output further changes as the following.
(catcache,USERMAPPINGUSERSERVER[78],CacheMemoryContext,2,8192,1,6136,0,2056)
(catcache,USERMAPPINGOID[77],CacheMemoryContext,2,8192,1,6136,0,2056)
(catcache,TYPEOID[76],CacheMemoryContext,2,16384,2,7592,3,8792)
The number enclosed by brackets is cache id. It is useles for users
but convenient for debugging:p
catcache_individual_mcxt_2.patch.txt: rebased version of per-catcache context.
catcachecxt_by_name.patch.txt: gives a meaningful name to catcache contexts.
catcachecxt_by_name_id.patch.txt: and adds cache id.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center