hash_search_with_hash_value is high in "perf top" on a replica

Started by Dmitry Koterovabout 1 year ago13 messages
Jump to latest
#1Dmitry Koterov
dmitry.koterov@gmail.com

Hi.

Debugging some replication lag on a replica when the master node
experiences heavy writes.

PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys),
which is a little surprising (what is it doing with all those CPU cycles?
it looked like WAL replay should be more IO bound than CPU bound?).

Running "perf top -p <pid>", it shows this:

Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value
5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel
3.42% postgres [.] XLogReadBufferExtended
2.35% libc.so.6 [.] 0x0000000000097f4c
2.04% postgres [.] pg_comp_crc32c_armv8
1.77% [kernel] [k] mutex_lock
1.63% postgres [.] XLogPrefetcherReadRecord
1.56% postgres [.] DecodeXLogRecord
1.55% postgres [.] LWLockAcquire

This is more surprising: hash_search_with_hash_value is a hash table lookup
function, is it expected that it is #1 in the profiler? (Called mostly from
PrefetchSharedBuffer*.)

Configuration:
- PG17
- ZFS, compression is off, empty database (with compression on, most of
"startup recovering" CPU was spent in ZFS guts doing
compression/decompression according to the profiler)
- full_page_writes=off, recovery_prefetch=on (ZFS supports it, I tested
with a small C program), wal_decode_buffer_size=2048kB (it doesn't seem to
affect anything though).
- shared_buffers = 25% of RAM
- testing with a giant COPY command basically

My main question is about hash_search_with_hash_value
<https://github.com/postgres/postgres/blob/59d6c03956193f622c069a4ab985bade27384ac4/src/backend/utils/hash/dynahash.c#L968&gt;
CPU usage. With both recovery_prefetch=on and off, it is the topmost
function in "perf top". I see no IO bottleneck on the machine, it's only
"startup recovering" maxing out one CPU core.

Maybe it's a red herring though, but it looks pretty suspicious.

#2Srinath Reddy Sadipiralla
srinath2133@gmail.com
In reply to: Dmitry Koterov (#1)
Re: hash_search_with_hash_value is high in "perf top" on a replica

On Fri, Jan 31, 2025 at 5:00 PM Dmitry Koterov <dmitry.koterov@gmail.com>
wrote:

Hi.

Debugging some replication lag on a replica when the master node
experiences heavy writes.

PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys),
which is a little surprising (what is it doing with all those CPU cycles?
it looked like WAL replay should be more IO bound than CPU bound?).

Running "perf top -p <pid>", it shows this:

Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value
5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel
3.42% postgres [.] XLogReadBufferExtended
2.35% libc.so.6 [.] 0x0000000000097f4c
2.04% postgres [.] pg_comp_crc32c_armv8
1.77% [kernel] [k] mutex_lock
1.63% postgres [.] XLogPrefetcherReadRecord
1.56% postgres [.] DecodeXLogRecord
1.55% postgres [.] LWLockAcquire

This is more surprising: hash_search_with_hash_value is a hash table
lookup function, is it expected that it is #1 in the profiler? (Called
mostly from PrefetchSharedBuffer*.)

Configuration:
- PG17
- ZFS, compression is off, empty database (with compression on, most of
"startup recovering" CPU was spent in ZFS guts doing
compression/decompression according to the profiler)
- full_page_writes=off, recovery_prefetch=on (ZFS supports it, I tested
with a small C program), wal_decode_buffer_size=2048kB (it doesn't seem to
affect anything though).
- shared_buffers = 25% of RAM
- testing with a giant COPY command basically

My main question is about hash_search_with_hash_value
<https://github.com/postgres/postgres/blob/59d6c03956193f622c069a4ab985bade27384ac4/src/backend/utils/hash/dynahash.c#L968&gt;
CPU usage. With both recovery_prefetch=on and off, it is the topmost
function in "perf top". I see no IO bottleneck on the machine, it's only
"startup recovering" maxing out one CPU core.

Maybe it's a red herring though, but it looks pretty suspicious.

Hi,

i think high CPU usage make sense,as hash_search_with_hash_value is called
every time when ever startup process in replica reads a wal and tries to
redo it ,as it goes through readbuffer_common to check if the page which
the wal effects and wants to redo is in buffer pool or not using
BufferAlloc->BufTableLookup>hash_search_with_hash_value.

Regards,
Srinath Reddy Sadipiralla,
EDB: https://www.enterprisedb.com <http://www.enterprisedb.com/&gt;

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dmitry Koterov (#1)
Re: hash_search_with_hash_value is high in "perf top" on a replica

On 2025-Jan-31, Dmitry Koterov wrote:

PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys),
which is a little surprising (what is it doing with all those CPU cycles?
it looked like WAL replay should be more IO bound than CPU bound?).

Running "perf top -p <pid>", it shows this:

Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value

Yeah, I noticed that this function was showing high in some profiles a
couple of days ago as well. Looking now at hashfn.c (hash_bytes_uint32
there is the function involved in the buffer mapping hash table), the
comments state that we're updated to Bob Jenkins code from 2006, but
there's a version in his website that (if I read correctly) is twice as
fast as what we're using now:
http://burtleburtle.net/bob/hash/spooky.html

Apparently this code in our repo is mostly unchanged since commit
1f559b7d3aa4, in 2007.

He mentions that on Intel chips, Google's CityHash is faster; but they
in turn claim that the difference is small on Intel chips and that
Jenkins' hash is better on AMD chips.
https://github.com/google/cityhash

Anyway if you wanted to try your luck at improving things, here's your
chance.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#4Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alvaro Herrera (#3)
Re: hash_search_with_hash_value is high in "perf top" on a replica

31.01.2025 17:25, Álvaro Herrera пишет:

On 2025-Jan-31, Dmitry Koterov wrote:

PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys),
which is a little surprising (what is it doing with all those CPU cycles?
it looked like WAL replay should be more IO bound than CPU bound?).

Running "perf top -p <pid>", it shows this:

Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value

Yeah, I noticed that this function was showing high in some profiles a
couple of days ago as well. Looking now at hashfn.c (hash_bytes_uint32
there is the function involved in the buffer mapping hash table), the
comments state that we're updated to Bob Jenkins code from 2006, but
there's a version in his website that (if I read correctly) is twice as
fast as what we're using now:
http://burtleburtle.net/bob/hash/spooky.html

Apparently this code in our repo is mostly unchanged since commit
1f559b7d3aa4, in 2007.

He mentions that on Intel chips, Google's CityHash is faster; but they
in turn claim that the difference is small on Intel chips and that
Jenkins' hash is better on AMD chips.
https://github.com/google/cityhash

Anyway if you wanted to try your luck at improving things, here's your
chance.

`hash_search_with_hash_value` uses already calculated hash value, so its
performance doesn't depend on performance of hash function
(hash_bytes_uint32 or any other).

I believe, it is memory bound, since dynahash does at least three
indirection jumps before it event reaches first node in linked list

segp = HTAB->dir[segment_num];
*bucketptr = segp[segment_ndx];

And then iterates through linked list with each iteration is being (for
large hash table) both cache and TLB (without huge pages) miss.

#5Andres Freund
andres@anarazel.de
In reply to: Dmitry Koterov (#1)
Re: hash_search_with_hash_value is high in "perf top" on a replica

Hi,

On 2025-01-31 03:30:35 -0800, Dmitry Koterov wrote:

Debugging some replication lag on a replica when the master node
experiences heavy writes.

PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys),
which is a little surprising (what is it doing with all those CPU cycles?
it looked like WAL replay should be more IO bound than CPU bound?).

Running "perf top -p <pid>", it shows this:

Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value

It'd be interesting to see what the paths towards hash_search_with_hash_value
are.

You said it's a COPY workloads, which surprises me a bit, because that should
normally be a bit less sensitive to it. Perhaps you have triggers or such that
prevent use of the multi-insert path?

5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel

These two suggest that it might be worth compiling with an -march CPU that
provides native atomics (everything above armv8.1-a, I think).

Maybe it's a red herring though, but it looks pretty suspicious.

It's unfortunately not too surprising - our buffer mapping table is a pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.

Greetings,

Andres Freund

#6Dmitry Koterov
dmitry.koterov@gmail.com
In reply to: Andres Freund (#5)
Re: hash_search_with_hash_value is high in "perf top" on a replica

It'd be interesting to see what the paths towards

hash_search_with_hash_value
are.

One of the popular paths is on the screenshot. They are all more or less
the same when recovery_prefetch=on (and when it's off, the replica behaves
worse, more replication lag).

That COPY command - it's the initial sync stage after CREATE SUBSCRIPTION
basically (with streaming=off, default). Large table (hundreds of
gigabytes).

[image: image.png]

It's also interesting, how gradually the disk writes and IOPS grow on the
replica during that large table COPY (dark blue line on the screenshot
below). And in the end, when "EXPLAIN SELECT 1 FROM mytable" showed ~100%
of the rows (i.e. almost everything is copied), it got stuck or ~5 minutes
with a spike on both reads and writes (the dark blue charts are all about
the replica). I.e. that COPY load, it is not spread uniformly: in the end,
it causes more load (I also saw that on the "startup recovering" process
CPU metrics).

[image: CleanShot 2025-02-01 at 03.42.16@2x.png]

On Fri, Jan 31, 2025 at 6:43 AM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2025-01-31 03:30:35 -0800, Dmitry Koterov wrote:

Debugging some replication lag on a replica when the master node
experiences heavy writes.

PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys),
which is a little surprising (what is it doing with all those CPU cycles?
it looked like WAL replay should be more IO bound than CPU bound?).

Running "perf top -p <pid>", it shows this:

Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value

It'd be interesting to see what the paths towards
hash_search_with_hash_value
are.

You said it's a COPY workloads, which surprises me a bit, because that
should
normally be a bit less sensitive to it. Perhaps you have triggers or such
that
prevent use of the multi-insert path?

5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel

These two suggest that it might be worth compiling with an -march CPU that
provides native atomics (everything above armv8.1-a, I think).

Maybe it's a red herring though, but it looks pretty suspicious.

It's unfortunately not too surprising - our buffer mapping table is a
pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.

Greetings,

Andres Freund

Attachments:

image.pngimage/png; name=image.pngDownload+2-2
CleanShot 2025-02-01 at 03.42.16@2x.pngimage/png; name="CleanShot 2025-02-01 at 03.42.16@2x.png"Download
#7Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#5)
Re: hash_search_with_hash_value is high in "perf top" on a replica

On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote:

Maybe it's a red herring though, but it looks pretty suspicious.

It's unfortunately not too surprising - our buffer mapping table is a
pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.

I measured similar things when looking at apply throughput recently. For
in-cache workloads buffer lookup and locking was about half of the load.

One other direction is to extract more memory concurrency. Prefetcher could
batch multiple lookups together so CPU OoO execution has a chance to fire
off multiple memory accesses at the same time.

The other direction is to split off WAL decoding, buffer lookup and maybe
even pinning to a separate process from the main redo loop.

--
Ants Aasma

Show quoted text
#8Andres Freund
andres@anarazel.de
In reply to: Dmitry Koterov (#6)
Re: hash_search_with_hash_value is high in "perf top" on a replica

Hi,

On 2025-02-01 03:46:33 -0800, Dmitry Koterov wrote:

It'd be interesting to see what the paths towards

hash_search_with_hash_value
are.

One of the popular paths is on the screenshot. They are all more or less
the same when recovery_prefetch=on (and when it's off, the replica behaves
worse, more replication lag).

Yea, I was hoping for a profile with recovery_prefetch=off, so we can see what
records are causing most of the "slow" lookups.

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#7)
Re: hash_search_with_hash_value is high in "perf top" on a replica

Hi,

On 2025-02-01 15:43:41 +0100, Ants Aasma wrote:

On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote:

Maybe it's a red herring though, but it looks pretty suspicious.

It's unfortunately not too surprising - our buffer mapping table is a
pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.

I measured similar things when looking at apply throughput recently. For
in-cache workloads buffer lookup and locking was about half of the load.

One other direction is to extract more memory concurrency. Prefetcher could
batch multiple lookups together so CPU OoO execution has a chance to fire
off multiple memory accesses at the same time.

I think at the moment we have a *hilariously* cache-inefficient buffer lookup,
that's the first thing to address. A hash table for buffer mapping lookups imo
is a bad idea, due to loosing all locality in a workload that exhibits a *lot*
of locality. But furthermore, dynahash.c is very far from a cache efficient
hashtable implementation.

The other aspect is that in many workloads we'll look up a small set of
buffers over and over, which a) wastes cycles b) wastes cache space for stuff
that could be elided much more efficiently.

We also do a lot of hash lookups for smgr, because we don't have any
cross-record caching infrastructure for that.

The other direction is to split off WAL decoding, buffer lookup and maybe
even pinning to a separate process from the main redo loop.

Maybe, but I think we're rather far away from those things being the most
productive thing to tackle.

Greetings,

Andres Freund

#10Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#9)
Re: hash_search_with_hash_value is high in "perf top" on a replica

On Sat, 1 Feb 2025 at 16:55, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-02-01 15:43:41 +0100, Ants Aasma wrote:

On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote:

Maybe it's a red herring though, but it looks pretty suspicious.

It's unfortunately not too surprising - our buffer mapping table is a
pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.

I measured similar things when looking at apply throughput recently. For
in-cache workloads buffer lookup and locking was about half of the load.

One other direction is to extract more memory concurrency. Prefetcher could
batch multiple lookups together so CPU OoO execution has a chance to fire
off multiple memory accesses at the same time.

I think at the moment we have a *hilariously* cache-inefficient buffer lookup,
that's the first thing to address. A hash table for buffer mapping lookups imo
is a bad idea, due to loosing all locality in a workload that exhibits a *lot*
of locality. But furthermore, dynahash.c is very far from a cache efficient
hashtable implementation.

In case you might be interested, I've sent a new patch with a new
approach to reducing the buffer lookup table's memory in [0]/messages/by-id/CAEze2WiRo4Zu71jwxYmqjq6XK814Avf2-kytaL6n=BreZR2ZbA@mail.gmail.com, which
attempts to create a more cache-efficient hash table implementation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: /messages/by-id/CAEze2WiRo4Zu71jwxYmqjq6XK814Avf2-kytaL6n=BreZR2ZbA@mail.gmail.com

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Ants Aasma (#7)
Re: hash_search_with_hash_value is high in "perf top" on a replica

On Sun, Feb 2, 2025 at 3:44 AM Ants Aasma <ants.aasma@cybertec.at> wrote:

The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the main redo loop.

Hi Ants,

FWIW I have a patch set that changes xlogprefetcher.c to use
read_stream.c, which I hope to propose for 19. It pins ahead of time,
which means that it already consolidates some pin/unpin/repin
sequences when it looks ahead, a little bit like what you're
suggesting, just not in a separate process. Here[1]https://github.com/macdice/postgres/commit/52533b652e544464add6f174019161889a37ed93 is a preview, but
it needs some major rebasing and has some issues... it's on my list...
I mention this because, although the primary goal of the
recovery-streamification project is recovery I/O performance,
especially anticipating direct I/O (especially without FPW), there are
many more opportunities along those lines for cached blocks. I've
written about some of them before and posted toy sketch code, but I
didn't let myself get too distracted as the first thing seems to be
real streamification (replaying LsnReadQueue). With a bit more work
it could avoid lots of buffer mapping table lookups too: I posted a
few ideas and toy patches, one based on already held pins which begins
to pay off once you hold many pins in a window, and one more generally
caching recently accessed buffers in smgr relations. That'd be
independent of whether our buffer mapping table also sucks and needs a
rewrite, it bypasses it completely in many interesting cases, and
would need to be so cheap as to be basically free when it doesn't
help. An assumption I just made, in remembering all that: OP didn't
mention it but I guess that this COPY replay is probably repeatedly
hammering the same pages from separate records here, because otherwise
multi-insert stuff would already avoid a lot of mapping table lookups
already? There are many more automatic batching opportunities, like
holding content locks across records (I posted a toy patch to measure
the potential speedup for that once, too lazy to search for it...),
and then to really benefit from all of these things in more general
real workloads I think you need to be able to do a little bit of
re-ordering because eg heap/index accesses are so often interleaved.
IOW there seem to be plenty of lower-hanging fruit in the "mechanical
sympathy" category, before you need another thread.

No argument against also making the hashing faster, the mapping table
better, and eventually parallelising some bits when it's not just
compensating for inefficient code that throws away all the locality, I
just wanted to mention that related work in the pipeline :-)

[1]: https://github.com/macdice/postgres/commit/52533b652e544464add6f174019161889a37ed93

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#11)
Re: hash_search_with_hash_value is high in "perf top" on a replica

On Wed, Feb 5, 2025 at 10:22 AM Thomas Munro <thomas.munro@gmail.com> wrote:

(replaying LsnReadQueue)

s/replaying/replacing/

#13Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Thomas Munro (#11)
Re: hash_search_with_hash_value is high in "perf top" on a replica

Hi Thomas!

On Tue, Feb 4, 2025 at 10:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Feb 2, 2025 at 3:44 AM Ants Aasma <ants.aasma@cybertec.at> wrote:

The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the main redo loop.

Hi Ants,

[..]

An assumption I just made, in remembering all that: OP didn't
mention it but I guess that this COPY replay is probably repeatedly
hammering the same pages from separate records here, because otherwise
multi-insert stuff would already avoid a lot of mapping table lookups
already?

Basic COPY (he mentioned basic CREATE SUBSCRIPTION table copy, so I
assume it's fresh): ends emitting stuff like this (note the "blk" is
increasing):

rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn:
0/040069A0, prev 0/04004950, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 5
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn:
0/040077F8, prev 0/040069A0, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 6
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn:
0/04008668, prev 0/040077F8, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 7
rmgr: Heap2 len (rec/tot): 3122/ 3122, tx: 759, lsn:
0/040094C0, prev 0/04008668, desc: MULTI_INSERT+INIT ntuples: 192,
flags: 0x02, blkref #0: rel 1663/5/16393 blk 8

now if the table would have PK , we end up doing *massive*
INSERT_LEAFs due to lack of batched btree emision like just in heap2
case:

rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 763, lsn:
0/05000028, prev 0/041C8198, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16396 blk 0
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 763, lsn:
0/05000E80, prev 0/05000028, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16396 blk 1
[..]
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004050, prev 0/05003FD8, desc: INSERT_LEAF off: 1, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004090, prev 0/05004050, desc: INSERT_LEAF off: 2, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/050040D0, prev 0/05004090, desc: INSERT_LEAF off: 3, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004110, prev 0/050040D0, desc: INSERT_LEAF off: 4, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004150, prev 0/05004110, desc: INSERT_LEAF off: 5, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004190, prev 0/05004150, desc: INSERT_LEAF off: 6, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/050041D0, prev 0/05004190, desc: INSERT_LEAF off: 7, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004210, prev 0/050041D0, desc: INSERT_LEAF off: 8, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004250, prev 0/05004210, desc: INSERT_LEAF off: 9, blkref #0: rel
1663/5/16398 blk 1
[..]

I don't know exactly which situation Dmitry has hit while the second
scenario would be much easier to optimize. This of course reminds me
of Your earlier work on recet_buffer optimization too and Andres
mentioned some form of LRU cache to just protect the hash table for
buffer mapping lookups. Also I was under the impression that work by
Bhrath [0]/messages/by-id/CALj2ACVi9eTRYR=gdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ@mail.gmail.com could help here too to lower the number of WAL records
emitted.

@Dmitry : So if you're reading then this is a known problem for a
while (max performance ceiling, see [1]/messages/by-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com), but there are a myriad of
possible long-term solutions in the code. Outside of hacking PG
changes, you could only probably split the big table on publisher into
many smaller partitions (but not too many), and then having COPY
running on those smaller ones with some delays so that replication lag
doesn't grow too much. The worst alternative is trying to get the
fastest possible cores (it's a single threaded bottleneck). As Andres
noted you probably could try to recompile with some better -march flag
on that ARM and see how much that helps.

-J.

[0]: /messages/by-id/CALj2ACVi9eTRYR=gdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ@mail.gmail.com
[1]: /messages/by-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com