What should I expect when creating many logical replication slots?

Started by Antonin Basabout 2 years ago4 messagesgeneral

antonin.bas@gmail.com

about 2 years ago

Hi all,

I have a use case for which I am considering using Postgres Logical
Replication, but I would like to scale up to 100 or even 200
replication slots.

I have increased max_wal_senders and max_replication_slots to 100 (also
making sure that max_connections is large enough). Things seem to be
working pretty well so far based on some PoC code I have written. Postgres
is creating a walsender process for each replication slot, as expected, and
the memory footprint of each one is around 4MB.

So I am quite happy with the way things are working, but I am a bit uneasy
about increasing these configuration values by 10-20x compared to their
defaults (both max_wal_senders and max_replication_slots default to 10).

Is there anything I should be looking out for specifically? Is it
considered an anti-pattern to use that many replication slots and walsender
processes? And, when my database comes under heavy write load, will
walsender processes start consuming a large amount of CPU / memory (I
recognize that this is a vague question, I am still working on some
empirical testing).

Finally, I am currently using Postgres 14. Should I consider upgrading to
Postgres 15 or 16 based on my use case?

Thanks in advance for any insight on this.

Antonin

Jim.Nasby@BlueTreble.com

about 2 years ago

In reply to: Antonin Bas (#1)

Re: What should I expect when creating many logical replication slots?

On 1/11/24 6:17 PM, Antonin Bas wrote:

Hi all,

I have a use case for which I am considering using Postgres Logical
Replication, but I would like to scale up to 100 or even 200
replication slots.

I have increased max_wal_senders and max_replication_slots to 100 (also
making sure that max_connections is large enough). Things seem to be
working pretty well so far based on some PoC code I have written.
Postgres is creating a walsender process for each replication slot, as
expected, and the memory footprint of each one is around 4MB.

So I am quite happy with the way things are working, but I am a bit
uneasy about increasing these configuration values by 10-20x compared to
their defaults (both max_wal_senders and max_replication_slots default
to 10).

Is there anything I should be looking out for specifically? Is it
considered an anti-pattern to use that many replication slots and
walsender processes? And, when my database comes under heavy write load,
will walsender processes start consuming a large amount of CPU / memory
(I recognize that this is a vague question, I am still working on some
empirical testing).

The biggest issue with logical decoding (what drives logical
replication) is that every subscriber has to completely decode
everything for it's publication, which can be extremely memory intensive
under certain circumstances (long running transacitons being one
potential trigger). Decoders also have to read through all WAL traffic,
regardless of what their publication is set to - everything runs of the
single WAL stream.

Note that this only applies to actually decoding - simply having a large
number of slots isn't much of an issue. Even having a large number of
subscribers that aren't consuming isn't a resource issue (though it IS
an issue for MVCC / vacuuming!) - to test you need to have all the
decoders that you expect to support.

Ultimately, I'd be concerned with trying to support 100+ slots unless
you know that your change rate isn't super high and that you don't have
long-running transactions.
--
Jim Nasby, Data Architect, Austin TX

antonin.bas@gmail.com

about 2 years ago

In reply to: Jim Nasby (#2)

Re: What should I expect when creating many logical replication slots?

Hi Jim. Thanks for taking the time to reply. Please see below.

Le mar. 16 janv. 2024 à 10:51, Jim Nasby <jim.nasby@gmail.com> a écrit :

On 1/11/24 6:17 PM, Antonin Bas wrote:

Hi all,

I have a use case for which I am considering using Postgres Logical
Replication, but I would like to scale up to 100 or even 200
replication slots.

I have increased max_wal_senders and max_replication_slots to 100 (also
making sure that max_connections is large enough). Things seem to be
working pretty well so far based on some PoC code I have written.
Postgres is creating a walsender process for each replication slot, as
expected, and the memory footprint of each one is around 4MB.

So I am quite happy with the way things are working, but I am a bit
uneasy about increasing these configuration values by 10-20x compared to
their defaults (both max_wal_senders and max_replication_slots default
to 10).

Is there anything I should be looking out for specifically? Is it
considered an anti-pattern to use that many replication slots and
walsender processes? And, when my database comes under heavy write load,
will walsender processes start consuming a large amount of CPU / memory
(I recognize that this is a vague question, I am still working on some
empirical testing).

The biggest issue with logical decoding (what drives logical
replication) is that every subscriber has to completely decode
everything for it's publication, which can be extremely memory intensive
under certain circumstances (long running transacitons being one
potential trigger). Decoders also have to read through all WAL traffic,
regardless of what their publication is set to - everything runs of the
single WAL stream.

That seems to be the biggest issue for me.
I wanted to create a publication for a single table with a low change rate.
But it sounds based on what you are describing that if the transaction rate
is very high for other tables (not part of the publication), it will affect
resource consumption of the walsender processes, which will have to decode
unrelated WAL traffic. Am I understanding correctly?

Show quoted text

Note that this only applies to actually decoding - simply having a large
number of slots isn't much of an issue. Even having a large number of
subscribers that aren't consuming isn't a resource issue (though it IS
an issue for MVCC / vacuuming!) - to test you need to have all the
decoders that you expect to support.

Ultimately, I'd be concerned with trying to support 100+ slots unless
you know that your change rate isn't super high and that you don't have
long-running transactions.
--
Jim Nasby, Data Architect, Austin TX

klaus.mailinglists@pernau.at

about 2 years ago

In reply to: Jim Nasby (#2)

Re: What should I expect when creating many logical replication slots?

Am 2024-01-16 19:51, schrieb Jim Nasby:

On 1/11/24 6:17 PM, Antonin Bas wrote:

Hi all,

I have a use case for which I am considering using Postgres Logical
Replication, but I would like to scale up to 100 or even 200
replication slots.

I have increased max_wal_senders and max_replication_slots to 100
(also making sure that max_connections is large enough). Things seem
to be working pretty well so far based on some PoC code I have
written. Postgres is creating a walsender process for each replication
slot, as expected, and the memory footprint of each one is around 4MB.

So I am quite happy with the way things are working, but I am a bit
uneasy about increasing these configuration values by 10-20x compared
to their defaults (both max_wal_senders and max_replication_slots
default to 10).

Is there anything I should be looking out for specifically? Is it
considered an anti-pattern to use that many replication slots and
walsender processes? And, when my database comes under heavy write
load, will walsender processes start consuming a large amount of CPU /
memory (I recognize that this is a vague question, I am still working
on some empirical testing).

The biggest issue with logical decoding (what drives logical
replication) is that every subscriber has to completely decode
everything for it's publication, which can be extremely memory
intensive under certain circumstances (long running transacitons being
one potential trigger). Decoders also have to read through all WAL
traffic, regardless of what their publication is set to - everything
runs of the single WAL stream.

Indeed, big/long-running transaction can be an issue. Today I added a
column with SERIAL to a table with 55 mio rows. Although that table is
not in the publication, the hugh transaction got written to the WAL and
spilled to disk by the WAL senders. Having 50 WAL senders, we basically
created a DoS-amplification-attack against our disk. Luckily we could
increase the disk size.*

Besides that, our DB has (in my point of view) plenty of UPDATE/INSERTs
and did not have replication performance problems with currently 50
replication slots. But if that would become a bottleneck, we would use
cascading replication, ie the master uses logical replication to 2
"distribution" servers, which further replicates to the 50+ subscribers.
That way, the inbetween "distribution" server would also filter out WAL
decoding for changes that are not part of the publication.

regards
Klaus

* Does somebody know why the WAL sender has to track transactions that
make changes to tables which are not the publication?