Logical decoding CPU-bound w/ large number of tables

Started by Mathieu Fenniakalmost 9 years ago11 messagesgeneral
Jump to latest
#1Mathieu Fenniak
mathieu.fenniak@replicon.com

Heyo,

I'm attempting to use logical decoding with the streaming replication
protocol to perform change-data-capture on PostgreSQL 9.5.4. I'm seeing
the replication stream "stall" for long periods of time where the walsender
process will be pinned at 100% CPU utilization, but no data is being sent
to my client.

The stalls occur unpredictably on my production system, but generally seem
to be correlated with schema operations. My source database has about
100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.

I've reproduced the same symptoms with two different approaches on my local
machine. With both, I have a replication client connected via the
streaming protocol.

In reproduction approach 1, I've created a thread that inserts small sets
of data, and a thread that creates a schema w/ 500 tables and then drops
it. This approach has pinned CPU usage, but data does come out of it --
just excruciatingly slow when compared to the same test without the schema
create & drop.

In reproduction approach 2, I've created a database w/ 100,000 tables on it
and performed a "vacuum ful". The walsender goes to 100% CPU and no data
comes out of the replication stream for hours.

I've performed a CPU sampling with the OSX `sample` tool based upon
reproduction approach #1:
https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 It
appears that most of the time is spent in the
RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
invalidation callbacks, both of which appear to be invalidating caches
based upon the cache value.

Has anyone else run into this kind of performance problem? Any thoughts on
how it might be resolved? I don't mind putting in the work if someone
could describe what is happening here, and have a discussion with me about
what kind of changes might be necessary to improve the performance.

Thanks all,

*Mathieu Fenniak* | Senior Software Architect | Phone 1-587-315-1185

*Replicon* | The leader in cloud time tracking applications - 7,800+
Customers - 70+ Countries - 1.5 Million Users
www.replicon.com | facebook <http://www.facebook.com/Replicon.inc&gt; |
linkedin <http://www.linkedin.com/company/27961?trk=tyah&gt; | twitter
<http://twitter.com/Replicon&gt; | blog <http://www.replicon.com/blog/&gt; | contact
us <http://www.replicon.com/about_replicon/contact_us.aspx&gt;

*We are hiring!* | search jobs <http://www.replicon.com/careers&gt;

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mathieu Fenniak (#1)
Re: Logical decoding CPU-bound w/ large number of tables

Mathieu Fenniak <mathieu.fenniak@replicon.com> writes:

I'm attempting to use logical decoding with the streaming replication
protocol to perform change-data-capture on PostgreSQL 9.5.4. I'm seeing
the replication stream "stall" for long periods of time where the walsender
process will be pinned at 100% CPU utilization, but no data is being sent
to my client.

I've performed a CPU sampling with the OSX `sample` tool based upon
reproduction approach #1:
https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8
It appears that most of the time is spent in the
RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
invalidation callbacks, both of which appear to be invalidating caches
based upon the cache value.

Hmm ... as for RelfilenodeMapInvalidateCallback, the lack of calls to
hash_search() from it in your trace says that it usually isn't doing
anything useful. All the time is being spent in hash_seq_search,
uselessly iterating over the hashtable. I'm inclined to think that
we need a smarter data structure there, maybe an independent hashtable
tracking the reverse map from relation OID to filenode map entry.

As for CatalogCacheIdInvalidate, I wonder how many of those cycles
are doing something useful, and how many are being wasted in the outer
loop that just iterates over the cache list. We could trivially get
rid of that outer search by using syscache.c's array, as in the
attached patch. It'd be interesting to see if this patch helps your
scenario #1. (Patch is against HEAD but seems to apply cleanly to 9.5)

Most likely, your scenario #2 is completely stuck on the
RelfilenodeMapInvalidateCallback issue, though it would be good
to get a trace to confirm that.

regards, tom lane

Attachments:

avoid-search-in-catcache-invalidate.patchtext/x-diff; charset=us-ascii; name=avoid-search-in-catcache-invalidate.patchDownload+113-96
#3Andres Freund
andres@anarazel.de
In reply to: Mathieu Fenniak (#1)
Re: Logical decoding CPU-bound w/ large number of tables

Hi,

On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote:

The stalls occur unpredictably on my production system, but generally seem
to be correlated with schema operations. My source database has about
100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.

I'm unfortunately not entirely surprised you're seeing some issues in
that case. We're invalidating internal caches a bit bit
overjudiciously, and that invalidation is triggered by schema changes.

I've performed a CPU sampling with the OSX `sample` tool based upon
reproduction approach #1:
https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 It
appears that most of the time is spent in the
RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
invalidation callbacks, both of which appear to be invalidating caches
based upon the cache value.

I think optimizing those has some value (and I see Tom is looking at
that aspect, but the bigger thing would probably be to do fewer lookups.

Has anyone else run into this kind of performance problem? Any thoughts on
how it might be resolved? I don't mind putting in the work if someone
could describe what is happening here, and have a discussion with me about
what kind of changes might be necessary to improve the performance.

If you could provide an easily runnable sql script that reproduces the
issue, I'll have a look. I think I have a rough idea what to do.

Greetings,

Andres Freund

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: Logical decoding CPU-bound w/ large number of tables

Hi,

On 2017-05-05 20:59:09 -0400, Tom Lane wrote:

Hmm ... as for RelfilenodeMapInvalidateCallback, the lack of calls to
hash_search() from it in your trace says that it usually isn't doing
anything useful. All the time is being spent in hash_seq_search,
uselessly iterating over the hashtable. I'm inclined to think that
we need a smarter data structure there, maybe an independent hashtable
tracking the reverse map from relation OID to filenode map entry.

Yea, that might be worthwhile. Let me try to address the issue that we
do way too much invalidation, then we can check whether this is still
exercised hotly. On the other hand, it's still a dumb invalidation
approach, so if somebody feels like working on this...

As for CatalogCacheIdInvalidate, I wonder how many of those cycles
are doing something useful, and how many are being wasted in the outer
loop that just iterates over the cache list. We could trivially get
rid of that outer search by using syscache.c's array, as in the
attached patch. It'd be interesting to see if this patch helps your
scenario #1. (Patch is against HEAD but seems to apply cleanly to 9.5)

I've seen this be a significant fraction of CPU time completely
independent of logical decoding, so I'd guess this is worthwhile
independently. Not sure what a good benchmark for this would be.

Greetings,

Andres Freund

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: Logical decoding CPU-bound w/ large number of tables

Andres Freund <andres@anarazel.de> writes:

On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote:

It appears that most of the time is spent in the
RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
invalidation callbacks, both of which appear to be invalidating caches
based upon the cache value.

I think optimizing those has some value (and I see Tom is looking at
that aspect, but the bigger thing would probably be to do fewer lookups.

I'm confused --- the lookup side of things is down in the noise in
Mathieu's trace. Further reducing the number of lookups doesn't seem
like it helps this scenario at all. It might matter once we've whacked
down the cost of invalidations ...

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Logical decoding CPU-bound w/ large number of tables

On 2017-05-05 21:32:27 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote:

It appears that most of the time is spent in the
RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
invalidation callbacks, both of which appear to be invalidating caches
based upon the cache value.

I think optimizing those has some value (and I see Tom is looking at
that aspect, but the bigger thing would probably be to do fewer lookups.

I'm confused --- the lookup side of things is down in the noise in
Mathieu's trace.

Err, sorry. Completely mangled that sentence. Executing fewer
invalidations. We currently are likely re-executing the same set of
invalidations constantly in Mathieu's case.

Background: When decoding a transaction during logical decoding we're
currently re-executing *all* a transaction's own cache invalidations, if
it has any, at every new command-id observed in the WAL stream. That's
because currently invalidations are only sent at commit, so we don't
know from "when" they are. But I think there's some very low-hanging
fruits reducing the frequency at which those are executed.

In many cases where there's just a few schema changes in a transaction,
this doesn't hurt badly. But if you have a transaction that does a
bootload of schema changes *and* a has a lot of other changes, it gets
expensive.

Mathieu: The above also indicates a possible workaround, you can try
separating larger amounts of data manipulations from schema changes,
into separate transactions.

Greetings,

Andres Freund

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Mathieu Fenniak
mathieu.fenniak@replicon.com
In reply to: Andres Freund (#6)
Re: Logical decoding CPU-bound w/ large number of tables

Hi Andres, Tom,

Andres, it seems like the problem is independent of having large data
manipulations mixed with schema changes. The below test case demonstrates
it with just schema changes.

Tom, I've tested your patch, and it seems to have a positive impact for
sure. I've documented a test case to reproduce this issue (below), and
your avoid-search-in-catcache-invalidate.patch reduces the test case time
from 63 seconds per run to 27 seconds per run. Here's an updated sample
capture with the patch applied:
https://gist.github.com/mfenniak/cef36699deb426efd8e433027553a37e

Here's the test case I've been using for the above measurements:

a) creating 10,000 tables in 200 schemas

DO $$
DECLARE
i INTEGER;
j INTEGER;
schema TEXT;
BEGIN
FOR i IN 1..200 LOOP
schema := 'test_' || i;
EXECUTE format('CREATE SCHEMA IF NOT EXISTS %I', schema);
FOR j IN 1..500 LOOP
EXECUTE format('CREATE TABLE IF NOT EXISTS %I.%I (f1 int, f2
int)', schema, 'test_' || j);
END LOOP;
END LOOP;
END;
$$;

b) start logical replication w/ pg_recvlogical, printing out timestamps of
messages and timestamps of transactions:

pg_recvlogical --create-slot --start -f - --if-not-exists
--plugin=test_decoding --slot=cputest --dbname=mathieu.fenniak
--username=postgres --option=include-timestamp | gawk '{ print
strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }'

c) drop a few schemas in a transaction

begin;
drop schema test_1 cascade;
drop schema test_2 cascade;
drop schema test_3 cascade;
drop schema test_4 cascade;
drop schema test_5 cascade;
drop schema test_6 cascade;
commit;

Six schema drops had a 63s second delay between the actual commit and
pg_recvlogical outputting it w/ stock 9.5.4:
2017-05-10 08:35:12 BEGIN 956
2017-05-10 08:36:15 COMMIT 956 (at 2017-05-10 08:35:12.106773-06)

Six schema drops w/ Tom's patch (applied to 9.5.4) had a 27 second delay:
2017-05-10 08:48:08 BEGIN 942
2017-05-10 08:48:35 COMMIT 942 (at 2017-05-10 08:48:08.182015-06)

(other test details: wal_level=logical; max_wal_senders=5;
wal_sender_timeout = 3600s; max_locks_per_transaction = 65536;
max_replication_slots = 5)

On Fri, May 5, 2017 at 7:38 PM, Andres Freund <andres@anarazel.de> wrote:

Show quoted text

On 2017-05-05 21:32:27 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote:

It appears that most of the time is spent in the
RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
invalidation callbacks, both of which appear to be invalidating caches
based upon the cache value.

I think optimizing those has some value (and I see Tom is looking at
that aspect, but the bigger thing would probably be to do fewer

lookups.

I'm confused --- the lookup side of things is down in the noise in
Mathieu's trace.

Err, sorry. Completely mangled that sentence. Executing fewer
invalidations. We currently are likely re-executing the same set of
invalidations constantly in Mathieu's case.

Background: When decoding a transaction during logical decoding we're
currently re-executing *all* a transaction's own cache invalidations, if
it has any, at every new command-id observed in the WAL stream. That's
because currently invalidations are only sent at commit, so we don't
know from "when" they are. But I think there's some very low-hanging
fruits reducing the frequency at which those are executed.

In many cases where there's just a few schema changes in a transaction,
this doesn't hurt badly. But if you have a transaction that does a
bootload of schema changes *and* a has a lot of other changes, it gets
expensive.

Mathieu: The above also indicates a possible workaround, you can try
separating larger amounts of data manipulations from schema changes,
into separate transactions.

Greetings,

Andres Freund

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mathieu Fenniak (#7)
Re: Logical decoding CPU-bound w/ large number of tables

Mathieu Fenniak <mathieu.fenniak@replicon.com> writes:

Andres, it seems like the problem is independent of having large data
manipulations mixed with schema changes. The below test case demonstrates
it with just schema changes.

Tom, I've tested your patch, and it seems to have a positive impact for
sure. I've documented a test case to reproduce this issue (below), and
your avoid-search-in-catcache-invalidate.patch reduces the test case time
from 63 seconds per run to 27 seconds per run.

Thanks for the test case. Your machine is apparently a bit faster than
mine, or else Assert overhead is really hurting in this case, because
on a debug-enabled build of HEAD I see the replication time as about
115 seconds, with the principal culprits according to "perf" being

+   53.11%    52.96%        253596  postmaster       postgres                                      [.] CatalogCacheIdInvalidate
+   31.15%    31.06%        148779  postmaster       postgres                                      [.] hash_seq_search
+    5.70%     5.69%         27223  postmaster       postgres                                      [.] CallSyscacheCallbacks
+    3.36%     3.35%         16057  postmaster       postgres                                      [.] LocalExecuteInvalidationMessage
+    1.32%     1.32%          6306  postmaster       postgres                                      [.] ReorderBufferCommit

I concur that the previously-posted patch to avoid searching in
CatalogCacheIdInvalidate makes for about a 2X speedup; for me,
that brings the test case down to 55 seconds, and now the top
hits are

+   59.28%    59.05%        150349  postmaster       postgres                                      [.] hash_seq_search
+    8.40%     8.36%         21250  postmaster       postgres                                      [.] CallSyscacheCallbacks
+    6.37%     6.34%         16115  postmaster       postgres                                      [.] LocalExecuteInvalidationMessage
+    5.69%     5.67%         14418  postmaster       postgres                                      [.] CatCacheInvalidate
+    3.14%     3.12%          7942  postmaster       postgres                                      [.] SysCacheInvalidate
+    1.72%     1.71%          4354  postmaster       postgres                                      [.] ReorderBufferCommit
+    1.37%     1.33%          3512  postmaster       postgres                                      [.] hash_search_with_hash_value
+    1.15%     1.15%          2913  postmaster       postgres                                      [.] InvalidateCatalogSnapshot

I looked at the hash_seq_search time a bit more, and realized that
actually there's a pretty easy fix for that, which is to reduce
the initial size of RelfilenodeMapHash from 1024 entries to 64.
You could quibble about where to set that exactly, but 1024 is
just way too many --- in your example, there are never more than
5 entries in the hash, despite the presence of 10000 tables in
the database. I also checked it while running the regression tests,
and while a few transactions get up to several hundred entries,
they're mostly less than 50. So rather than build a second index
structure for that hashtable, I propose we just do what's in the first
attached patch.

With that, I was down to about 21s run time, and now perf says

+   23.17%    23.08%         21254  postmaster       postgres                                        [.] CallSyscacheCallbacks
+   16.97%    16.91%         15579  postmaster       postgres                                        [.] LocalExecuteInvalidationMessage
+   16.10%    16.03%         14766  postmaster       postgres                                        [.] CatCacheInvalidate
+   12.45%    12.42%         11474  postmaster       postgres                                        [.] hash_seq_search
+    8.66%     8.64%          7959  postmaster       postgres                                        [.] SysCacheInvalidate
+    4.72%     4.70%          4331  postmaster       postgres                                        [.] ReorderBufferCommit
+    3.16%     3.14%          2894  postmaster       postgres                                        [.] InvalidateCatalogSnapshot
+    2.56%     2.50%          2344  postmaster       postgres                                        [.] hash_search_with_hash_value
+    1.27%     1.27%          1169  postmaster       postgres                                        [.] RelfilenodeMapInvalidateCallbac

Looking at CallSyscacheCallbacks, it's got exactly the same
disease of linearly scanning a list to find relevant entries,
when we could easily pre-index the list. The second patch
attached gets me to 13s, with profile

+   22.78%    22.70%         14568  postmaster       postgres                                     [.] CatCacheInvalidate
+   17.75%    17.69%         11355  postmaster       postgres                                     [.] LocalExecuteInvalidationMessage
+   17.11%    17.04%         10997  postmaster       postgres                                     [.] hash_seq_search
+    7.25%     7.22%          4634  postmaster       postgres                                     [.] SysCacheInvalidate
+    5.05%     5.03%          3229  postmaster       postgres                                     [.] CallSyscacheCallbacks
+    4.09%     4.08%          2615  postmaster       postgres                                     [.] ReorderBufferCommit
+    3.56%     3.55%          2276  postmaster       postgres                                     [.] InvalidateCatalogSnapshot
+    3.51%     3.40%          2240  postmaster       postgres                                     [.] hash_search_with_hash_value
+    1.76%     1.75%          1122  postmaster       postgres                                     [.] RelfilenodeMapInvalidateCallback

We're at a point of diminishing returns here; I think any further
improvement would require reducing the number of invalidation calls,
as Andres was speculating about upthread. Still, this shows that
it doesn't take very much work to get a 10X improvement in the
overhead associated with inval activities. We've never seen this
overhead stick out quite this much before, and maybe logical replication
will always be an atypical workload, but I think this may be worth
committing even if Andres does managed to cut the number of calls.

It would be interesting to see how much these patches help for your real
use-case, as opposed to this toy example. Assuming that the results are
positive, I would advocate for back-patching these changes as far as 9.4
where logical decoding came in.

BTW, I also noticed that we're getting scarily close to exceeding
MAX_SYSCACHE_CALLBACKS. There are 30-some calls to
CacheRegisterSyscacheCallback in our code, and while I think not all of
them can be reached in a single process, we demonstrably get as high as 21
registered callbacks in some regression test runs. That's not leaving a
lot of daylight for add-on modules. The second patch attached includes
increasing MAX_SYSCACHE_CALLBACKS from 32 to 64. I think we'd be well
advised to apply and back-patch that, even if we don't use the rest of
the patch.

regards, tom lane

Attachments:

smaller-hashtable.patchtext/x-diff; charset=us-ascii; name=smaller-hashtable.patchDownload+1-1
avoid-search-in-CallSyscacheCallbacks.patchtext/x-diff; charset=us-ascii; name=avoid-search-in-CallSyscacheCallbacks.patchDownload+41-13
#9Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#8)
Re: Logical decoding CPU-bound w/ large number of tables

Hi Tom, Mathieu,

On 2017-05-10 17:02:11 -0400, Tom Lane wrote:

Mathieu Fenniak <mathieu.fenniak@replicon.com> writes:

Andres, it seems like the problem is independent of having large data
manipulations mixed with schema changes. The below test case demonstrates
it with just schema changes.

Tom, I've tested your patch, and it seems to have a positive impact for
sure. I've documented a test case to reproduce this issue (below), and
your avoid-search-in-catcache-invalidate.patch reduces the test case time
from 63 seconds per run to 27 seconds per run.

Thanks for the test case.

Indeed!

+   59.28%    59.05%        150349  postmaster       postgres                                      [.] hash_seq_search
+    8.40%     8.36%         21250  postmaster       postgres                                      [.] CallSyscacheCallbacks
+    6.37%     6.34%         16115  postmaster       postgres                                      [.] LocalExecuteInvalidationMessage
+    5.69%     5.67%         14418  postmaster       postgres                                      [.] CatCacheInvalidate
+    3.14%     3.12%          7942  postmaster       postgres                                      [.] SysCacheInvalidate
+    1.72%     1.71%          4354  postmaster       postgres                                      [.] ReorderBufferCommit
+    1.37%     1.33%          3512  postmaster       postgres                                      [.] hash_search_with_hash_value
+    1.15%     1.15%          2913  postmaster       postgres                                      [.] InvalidateCatalogSnapshot

I looked at the hash_seq_search time a bit more, and realized that
actually there's a pretty easy fix for that, which is to reduce
the initial size of RelfilenodeMapHash from 1024 entries to 64.

Heh, that's not a bad idea. Given the table resizes automatically, there
seems little reason not to go there.

You could quibble about where to set that exactly, but 1024 is
just way too many --- in your example, there are never more than
5 entries in the hash, despite the presence of 10000 tables in
the database.

That's likely because there's no DML.

We're at a point of diminishing returns here; I think any further
improvement would require reducing the number of invalidation calls,
as Andres was speculating about upthread. Still, this shows that
it doesn't take very much work to get a 10X improvement in the
overhead associated with inval activities. We've never seen this
overhead stick out quite this much before, and maybe logical replication
will always be an atypical workload, but I think this may be worth
committing even if Andres does managed to cut the number of calls.

I've seen these callsites prominently in profiles not using logical
decoding. Never quite as big as here, but still. Temp table heavy
workloads IIRC are one of the easy way to trigger it.

It would be interesting to see how much these patches help for your real
use-case, as opposed to this toy example. Assuming that the results are
positive, I would advocate for back-patching these changes as far as 9.4
where logical decoding came in.

+1.

BTW, I also noticed that we're getting scarily close to exceeding
MAX_SYSCACHE_CALLBACKS. There are 30-some calls to
CacheRegisterSyscacheCallback in our code, and while I think not all of
them can be reached in a single process, we demonstrably get as high as 21
registered callbacks in some regression test runs. That's not leaving a
lot of daylight for add-on modules. The second patch attached includes
increasing MAX_SYSCACHE_CALLBACKS from 32 to 64. I think we'd be well
advised to apply and back-patch that, even if we don't use the rest of
the patch.

+1 - I'd advocate for doing so all the way.

Greetings,

Andres Freund

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Mathieu Fenniak
mathieu.fenniak@replicon.com
In reply to: Andres Freund (#9)
Re: Logical decoding CPU-bound w/ large number of tables

Hi Tom, Andres,

I've taken your patches, Tom, and applied them to a test deployment of my
actual application.

The most accessible way I have to reproduce this issue is to run a
maintenance task that we typically run during a software deployment, which
will remove some tenant schemas from our database system. I ran that task
with and without the patches, and monitored how long logical replication
stalled. This test was on a database with 22 schemas containing 400 tables
each (~8800 tables).

As a baseline, PostgreSQL 9.5.6 stalled completely for 21m 13s

PostgreSQL 9.5.6 + all three patches stalled completely for 4m 11s.

Definitely a fantastic improvement. :-)

This test scenario (performing the maintenance task) is unfortunately very
similar to the synthetic test... but it's the most consistent manner I've
found to reproduce this stall. I've observed the replication stall in
production at somewhat arbitrary times, so I'm not sure if there may be
different performance profiles in more production-like usage.

Here's a CPU performance sampling from the 4m period w/ the patched
PostgreSQL. This is from the `perf` tool on Linux, rather than the earlier
samples from my development machine on OSX. `perf report`:
https://gist.github.com/mfenniak/9a3484c426a497f3903051d9fdf8b221 Raw perf
data (49kb bzip2):
https://www.dropbox.com/s/3jfxtg9kjzjztmp/linux-pg-walsender-patched.perf.data.bz2?dl=0

Mathieu

On Wed, May 10, 2017 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi Tom, Mathieu,

On 2017-05-10 17:02:11 -0400, Tom Lane wrote:

Mathieu Fenniak <mathieu.fenniak@replicon.com> writes:

Andres, it seems like the problem is independent of having large data
manipulations mixed with schema changes. The below test case

demonstrates

it with just schema changes.

Tom, I've tested your patch, and it seems to have a positive impact for
sure. I've documented a test case to reproduce this issue (below), and
your avoid-search-in-catcache-invalidate.patch reduces the test case

time

from 63 seconds per run to 27 seconds per run.

Thanks for the test case.

Indeed!

+ 59.28% 59.05% 150349 postmaster postgres

[.] hash_seq_search

+ 8.40% 8.36% 21250 postmaster postgres

[.] CallSyscacheCallbacks

+ 6.37% 6.34% 16115 postmaster postgres

[.] LocalExecuteInvalidationMessage

+ 5.69% 5.67% 14418 postmaster postgres

[.] CatCacheInvalidate

+ 3.14% 3.12% 7942 postmaster postgres

[.] SysCacheInvalidate

+ 1.72% 1.71% 4354 postmaster postgres

[.] ReorderBufferCommit

+ 1.37% 1.33% 3512 postmaster postgres

[.] hash_search_with_hash_value

+ 1.15% 1.15% 2913 postmaster postgres

[.] InvalidateCatalogSnapshot

I looked at the hash_seq_search time a bit more, and realized that
actually there's a pretty easy fix for that, which is to reduce
the initial size of RelfilenodeMapHash from 1024 entries to 64.

Heh, that's not a bad idea. Given the table resizes automatically, there
seems little reason not to go there.

You could quibble about where to set that exactly, but 1024 is
just way too many --- in your example, there are never more than
5 entries in the hash, despite the presence of 10000 tables in
the database.

That's likely because there's no DML.

We're at a point of diminishing returns here; I think any further
improvement would require reducing the number of invalidation calls,
as Andres was speculating about upthread. Still, this shows that
it doesn't take very much work to get a 10X improvement in the
overhead associated with inval activities. We've never seen this
overhead stick out quite this much before, and maybe logical replication
will always be an atypical workload, but I think this may be worth
committing even if Andres does managed to cut the number of calls.

I've seen these callsites prominently in profiles not using logical
decoding. Never quite as big as here, but still. Temp table heavy
workloads IIRC are one of the easy way to trigger it.

It would be interesting to see how much these patches help for your real
use-case, as opposed to this toy example. Assuming that the results are
positive, I would advocate for back-patching these changes as far as 9.4
where logical decoding came in.

+1.

BTW, I also noticed that we're getting scarily close to exceeding
MAX_SYSCACHE_CALLBACKS. There are 30-some calls to
CacheRegisterSyscacheCallback in our code, and while I think not all of
them can be reached in a single process, we demonstrably get as high as

21

registered callbacks in some regression test runs. That's not leaving a
lot of daylight for add-on modules. The second patch attached includes
increasing MAX_SYSCACHE_CALLBACKS from 32 to 64. I think we'd be well
advised to apply and back-patch that, even if we don't use the rest of
the patch.

+1 - I'd advocate for doing so all the way.

Greetings,

Andres Freund

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mathieu Fenniak (#10)
Re: Logical decoding CPU-bound w/ large number of tables

Mathieu Fenniak <mathieu.fenniak@replicon.com> writes:

I've taken your patches, Tom, and applied them to a test deployment of my
actual application.

The most accessible way I have to reproduce this issue is to run a
maintenance task that we typically run during a software deployment, which
will remove some tenant schemas from our database system. I ran that task
with and without the patches, and monitored how long logical replication
stalled. This test was on a database with 22 schemas containing 400 tables
each (~8800 tables).

As a baseline, PostgreSQL 9.5.6 stalled completely for 21m 13s

PostgreSQL 9.5.6 + all three patches stalled completely for 4m 11s.

Definitely a fantastic improvement. :-)

Thanks for confirming that these patches help in a real-world scenario.
Obviously there's still room for improvement, but I think it's worth
pushing what I've done, so I'll go do that. I hope Andres will follow
up with trying to improve matters from the other angle.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general