Slow synchronous logical replication

Started by konstantin knizhnikabout 8 years ago13 messages
#1konstantin knizhnik
konstantin knizhnik
k.knizhnik@postgrespro.ru

In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20 times slower then without logical replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally developed for streaming replication, when all replicas have the same content and LSNs. When it is used for logical replication, it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in "synchronous_standby_names" list. So we are waiting not only for our own single logical replication standby, but all other standbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To provide uniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create 100 shards. As a result we get awful performance and blocking of any replication channel blocks all backends.

So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous replication mechanism for logical replication more efficient and are there some plans to work in this direction?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
Andres Freund
andres@anarazel.de
In reply to: konstantin knizhnik (#1)
Re: Slow synchronous logical replication

Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:

In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20 times slower then without logical replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally developed for streaming replication, when all replicas have the same content and LSNs. When it is used for logical replication, it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in "synchronous_standby_names" list. So we are waiting not only for our own single logical replication standby, but all other standbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To provide uniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create 100 shards. As a result we get awful performance and blocking of any replication channel blocks all backends.

So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous replication mechanism for logical replication more efficient and are there some plans to work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Konstantin Knizhnik
Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#2)
Re: Slow synchronous logical replication

On 10/07/2017 10:42 PM, Andres Freund wrote:

Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:

In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20 times slower then without logical replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally developed for streaming replication, when all replicas have the same content and LSNs. When it is used for logical replication, it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in "synchronous_standby_names" list. So we are waiting not only for our own single logical replication standby, but all other standbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To provide uniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create 100 shards. As a result we get awful performance and blocking of any replication channel blocks all backends.

So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous replication mechanism for logical replication more efficient and are there some plans to work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.

Sorry, If I was not clear.
The question was about logical replication mechanism in mainstream version of Postgres.
I think that most of people are using asynchronous logical replication and synchronous LR is something exotic and not well tested and investigated.
It will be great if I am wrong:)

Concerning our sharded cluster (pg_shardman) - it is not a commercial product yet, it is in development phase.
We are going to open its sources when it will be more or less stable.
But unlike multimaster, this sharded cluster is mostly built from existed components: pg_pathman + postgres_fdw + logical replication.
So we are just trying to combine them all into some integrated system.
But currently the most obscure point is logical replication.

And the main goal of my e-mail was to know the opinion of authors and users of LR whether it is good idea to use LR to provide fault tolerance in sharded cluster.
Or some other approaches, for example sharding with redundancy or using streaming replication are preferable?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Craig Ringer
Craig Ringer
craig@2ndquadrant.com
In reply to: Konstantin Knizhnik (#3)
Re: Slow synchronous logical replication

On 8 October 2017 at 03:58, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

The question was about logical replication mechanism in mainstream version
of Postgres.

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.

I think that most of people are using asynchronous logical replication and
synchronous LR is something exotic and not well tested and investigated.
It will be great if I am wrong:)

I doubt it's widely used. That said, a lot of people use synchronous
replication with BDR and pglogical, which are ancestors of the core
logical rep code and design.

I think you actually need to collect some proper timings and
diagnostics here, rather than hand-waving about it being "slow". A
good starting point might be setting some custom 'perf' tracepoints,
or adding some 'elog()'ing for timestamps. Then scrape the results and
build a latency graph.

That said, if I had to guess why it's slow, I'd say that you're facing
a number of factors:

* By default, logical replication in PostgreSQL does not do an
immediate flush to disk after downstream commit. In the interests of
faster apply performance it instead delays sending flush confirmations
until the next time WAL is flushed out. See the docs for CREATE
SUBSCRIPTION, notably the synchronous_commit option. This will
obviously greatly increase latencies on sync commit.

* Logical decoding doesn't *start* streaming a transaction until the
origin node finishes the xact and writes a COMMIT, then the xlogreader
picks it up.

* As a consequence of the above, a big xact holds up commit
confirmations of smaller ones by a LOT more than is the case for
streaming physical replication.

Hopefully that gives you something to look into, anyway. Maybe you'll
be inspired to work on parallelized logical decoding :)

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Konstantin Knizhnik
Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Craig Ringer (#4)
Re: Slow synchronous logical replication

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.

It will be not so easy to provide some reproducing scenario, because
actually it involves many components (postgres_fdw, pg_pasthman,
pg_shardman, LR,...)
and requires multinode installation.
But let me try to explain what going on:
So we have implement sharding - splitting data between several remote
tables using pg_pathman and postgres_fdw.
It means that insert or update of parent table cause insert or update
of some derived partitions which is forwarded by postgres_fdw to the
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for
5 nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical
replication to one or more nodes. Right now we considered only
redundancy level 1 - each shard has only one replica.
So from each node we establish 10 logical replication channels.

We want commit to wait until data is actually stored at all replicas, so
we are using synchronous replication:
So we set synchronous_commit option to "on" and include all ten 10
subscriptions in synchronous_standby_names list.

In this setup commit latency is very large (about 100msec and most of
the time is actually spent in commit) and performance is very bad -
pgbench shows about 300 TPS for optimal number of clients (about 10, for
larger number performance is almost the same). Without logical
replication at the same setup we get about 6000 TPS.

I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr
function. Each wal sender independently calculates minimal LSN among all
synchronous replicas and wakeup backends waiting for this LSN. It means
that transaction performing update of data in one shard will actually
wait confirmation from replication channels for all shards.
If some shard is updated rarely than other or is not updated at all (for
example because communication channels between this node is broken),
then all backens will stuck.
Also all backends are competing for the single SyncRepLock, which also
can be a contention point.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Konstantin Knizhnik (#5)
Re: Slow synchronous logical replication

On Mon, Oct 9, 2017 at 4:37 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.

It will be not so easy to provide some reproducing scenario, because
actually it involves many components (postgres_fdw, pg_pasthman,
pg_shardman, LR,...)
and requires multinode installation.
But let me try to explain what going on:
So we have implement sharding - splitting data between several remote tables
using pg_pathman and postgres_fdw.
It means that insert or update of parent table cause insert or update of
some derived partitions which is forwarded by postgres_fdw to the
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for 5
nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical
replication to one or more nodes. Right now we considered only redundancy
level 1 - each shard has only one replica.
So from each node we establish 10 logical replication channels.

We want commit to wait until data is actually stored at all replicas, so we
are using synchronous replication:
So we set synchronous_commit option to "on" and include all ten 10
subscriptions in synchronous_standby_names list.

In this setup commit latency is very large (about 100msec and most of the
time is actually spent in commit) and performance is very bad - pgbench
shows about 300 TPS for optimal number of clients (about 10, for larger
number performance is almost the same). Without logical replication at the
same setup we get about 6000 TPS.

I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function.
Each wal sender independently calculates minimal LSN among all synchronous
replicas and wakeup backends waiting for this LSN. It means that transaction
performing update of data in one shard will actually wait confirmation from
replication channels for all shards.
If some shard is updated rarely than other or is not updated at all (for
example because communication channels between this node is broken), then
all backens will stuck.
Also all backends are competing for the single SyncRepLock, which also can
be a contention point.

IIUC, I guess you meant to say that in current synchronous logical
replication a transaction has to wait for updated table data to be
replicated even on servers that don't subscribe for the table. If we
change it so that a transaction needs to wait for only the server that
are subscribing for the table it would be more efficiency, for at
least your use case.
We send at least the begin and commit data to all subscriptions and
then wait for the reply from them but can we skip to wait them, for
example, when the walsender actually didn't send any data modified by
the transaction?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Craig Ringer
Craig Ringer
craig@2ndquadrant.com
In reply to: Konstantin Knizhnik (#5)
Re: Slow synchronous logical replication

On 9 October 2017 at 15:37, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.

It will be not so easy to provide some reproducing scenario, because
actually it involves many components (postgres_fdw, pg_pasthman,
pg_shardman, LR,...)

So simplify it to a test case that doesn't.

I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function.
Each wal sender independently calculates minimal LSN among all synchronous
replicas and wakeup backends waiting for this LSN. It means that transaction
performing update of data in one shard will actually wait confirmation from
replication channels for all shards.

That's expected for the current sync rep design, yes. Because it's
based on lsn, and was designed for physical rep where there's no
question about whether we're sending some data to some peers and not
others.

So all backends will wait for the slowest-responding peer, including
peers that don't need to actually do anything for this xact. You could
possibly hack around that by having the output plugin advance the slot
position when it sees that it just processed an empty xact.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#5)
Re: Slow synchronous logical replication

Hi,

On 2017-10-09 10:37:01 +0300, Konstantin Knizhnik wrote:

So we have implement sharding - splitting data between several remote tables
using pg_pathman and postgres_fdw.
It means that insert or update of parent table cause insert or update of
some derived partitions which is forwarded by postgres_fdw to the
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for 5
nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical
replication to one or more nodes. Right now we considered only redundancy
level 1 - each shard has only one replica.
So from each node we establish 10 logical replication channels.

Isn't that part of the pretty fundamental problem? There shouldn't be 10
different replication channels per node. There should be one.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Konstantin Knizhnik
Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Craig Ringer (#7)
Re: Slow synchronous logical replication

On 11.10.2017 10:07, Craig Ringer wrote:

On 9 October 2017 at 15:37, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.

It will be not so easy to provide some reproducing scenario, because
actually it involves many components (postgres_fdw, pg_pasthman,
pg_shardman, LR,...)

So simplify it to a test case that doesn't.

The simplest reproducing scenario is the following:
1. Start two Posgtgres instances: synchronous_commit=on, fsync=off
2. Initialize pgbench database at both instances: pgbench -i
3. Create publication for pgbench_accounts table at one node
4. Create correspondent subscription at another node with
copy_data=false parameter
5. Add subscription to synchronous_standby_names at first node.
6. Start pgbench -c 8 -N -T 100 -P 1 at first node. At my systems
results are the following:
standalone postgres: 8600 TPS
asynchronous replication: 6600 TPS
synchronous replication: 5600 TPS
Quite good results.
7. Create some dummy table and perform bulk insert in it:
create table dummy(x integer primary key);
insert into dummy values (generate_series(1,10000000));

pgbench almost stuck: until end of insert performance drops almost
to zero.

The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.
Filtering of insert records of this transaction is done only inside
output plug-in.
Unfortunately it is not quite clear how to make wal-sender smarter and
let him skip transaction not affecting its publication.
Once of the possible solutions is to let backend inform wal-sender about
smallest LSN it should wait for (backend knows which table is affected
by current operation,
so which publications are interested in this operation and so can point
wal -sender to the proper LSN without decoding huge part of WAL.
But it seems to be not so easy to implement.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Craig Ringer
Craig Ringer
craig@2ndquadrant.com
In reply to: Konstantin Knizhnik (#9)
Re: Slow synchronous logical replication

On 12 October 2017 at 00:57, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction? Because
that's a known pathological case for logical replication, it has to do
a LOT of extra work when it's in a transaction that has done DDL. I'm
sure there's room for optimisation there, but the general
recommendation for now is "don't do that".

Filtering of insert records of this transaction is done only inside output
plug-in.

Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.

I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.

Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.

It's not so simple by table since we don't know early enough whether
the xact affects tables of interest or not. But you could definitely
do some selective skipping. Making it efficient could be the
challenge.

Once of the possible solutions is to let backend inform wal-sender about
smallest LSN it should wait for (backend knows which table is affected by
current operation,
so which publications are interested in this operation and so can point wal
-sender to the proper LSN without decoding huge part of WAL.
But it seems to be not so easy to implement.

Sounds like confusing layering violations to me.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Konstantin Knizhnik
Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Craig Ringer (#10)
2 attachment(s)
Re: Slow synchronous logical replication

On 12.10.2017 04:23, Craig Ringer wrote:

On 12 October 2017 at 00:57, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction?

No. Table was create in separate transaction.
Moreover the same effect will take place if table is create before
start of replication.
The problem in this case seems to be caused by spilling decoded
transaction to the file by ReorderBufferSerializeTXN.
Attached please find two profiles: lr1.svg corresponds to normal work if
pgbench with synchronous replication to one replica,
lr2.svg - the with concurrent execution of huge insert statement.

And here is output of pgbench (at fifth second insert is started):

progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232
progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192
progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301
progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194
progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192
progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370
progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338
progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309
progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964
progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082
progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598
progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503
progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206
progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377
progress: 15.0 s, 8.0 tps, lat 1111.578 ms stddev 140.987
progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677
progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058
progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361

It seems to be effect of large transactions.
Presence of several channels of synchronous logical replication reduce
performance, but not so much.
Below are results at another machine and pgbench with scale 10.

Configuraion
standalone
1 async logical replica
1 sync logical replca
3 async logical replicas
3 syn logical replicas
TPS
15k
13k
10k
13k
8k

Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.

Yes I know about origin filtering mechanism (and we are using it in
multimaster).
But I am speaking about standard pgoutput.c output plugin. it's
pgoutput_origin_filter
always returns false.

I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.

Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.

The problem is that before end of transaction we do not know whether it
touch this publication or not.
So filtering by origin will not work in this case.

I really not sure that it is possible to skip over WAL. But the
particular problem with invalidation records etc can be solved by
always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record
which always should be processed by WAL sender, it can always promote
LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip
(insert/update/delete records not affecting this publication).

I winder if there can be some other problems with skipping part of
transaction by WAL sender.

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

lr2.svgimage/svg+xml; name=lr2.svg
lr1.svgimage/svg+xml; name=lr1.svg
#12Konstantin Knizhnik
Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Craig Ringer (#10)
Re: Slow synchronous logical replication

On 12.10.2017 04:23, Craig Ringer wrote:

On 12 October 2017 at 00:57, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction?

No. Table was create in separate transaction.
Moreover the same effect will take place if table is create before
start of replication.
The problem in this case seems to be caused by spilling decoded
transaction to the file by ReorderBufferSerializeTXN.
Please look at two profiles:
http://garret.ru/lr1.svg corresponds to normal work if pgbench with
synchronous replication to one replica,
http://garret.ru/lr2.svg - the with concurrent execution of huge insert
statement.

And here is output of pgbench (at fifth second insert is started):

progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232
progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192
progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301
progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194
progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192
progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370
progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338
progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309
progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964
progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082
progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598
progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503
progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206
progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377
progress: 15.0 s, 8.0 tps, lat 1111.578 ms stddev 140.987
progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677
progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058
progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361

It seems to be effect of large transactions.
Presence of several channels of synchronous logical replication reduce
performance, but not so much.
Below are results at another machine and pgbench with scale 10.

Configuraion
standalone
1 async logical replica
1 sync logical replca
3 async logical replicas
3 syn logical replicas
TPS
15k
13k
10k
13k
8k

Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.

Yes I know about origin filtering mechanism (and we are using it in
multimaster).
But I am speaking about standard pgoutput.c output plugin. it's
pgoutput_origin_filter
always returns false.

I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.

Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.

The problem is that before end of transaction we do not know whether it
touch this publication or not.
So filtering by origin will not work in this case.

I really not sure that it is possible to skip over WAL. But the
particular problem with invalidation records etc can be solved by
always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record
which always should be processed by WAL sender, it can always promote
LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip
(insert/update/delete records not affecting this publication).

I wonder if there can be some other problems with skipping part of
transaction by WAL sender.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13Craig Ringer
Craig Ringer
craig@2ndquadrant.com
In reply to: Konstantin Knizhnik (#11)
Re: Slow synchronous logical replication

On 12 October 2017 at 16:09, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Is the CREATE TABLE and INSERT done in the same transaction?

No. Table was create in separate transaction.
Moreover the same effect will take place if table is create before start of replication.
The problem in this case seems to be caused by spilling decoded transaction to the file by ReorderBufferSerializeTXN.

Yeah. That's known to perform sub-optimally, and it also uses way more
memory than it should.

Your design compounds that by spilling transactions it will then
discard, and doing so multiple times.

To make your design viable you likely need some kind of cache of
serialized reorder buffer transactions, where you don't rebuild one if
it's already been generated. And likely a fair bit of optimisation on
the serialisation.

Or you might want a table- and even a row-filter that can be run
during decoding, before appending to the ReorderBuffer, to let you
skip changes early. Right now this can only be done at the transaction
level, based on replication origin. Of course, if you do this you
can't do the caching thing.

Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.

You'd need more hooks to be implemented by the output plugin.

I really not sure that it is possible to skip over WAL. But the particular problem with invalidation records etc can be solved by always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record which always should be processed by WAL sender, it can always promote LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip (insert/update/delete records not affecting this publication).

That sounds like a giant layering violation too.

I suggest focusing on reducing the amount of work done when reading
WAL, not trying to jump over whole ranges of WAL.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers