Multi-Master Logical Replication

Started by Peter Smithalmost 4 years ago35 messageshackers
Jump to latest
#1Peter Smith
smithpb2250@gmail.com

MULTI-MASTER LOGICAL REPLICATION

1.0 BACKGROUND

Let’s assume that a user wishes to set up a multi-master environment
so that a set of PostgreSQL instances (nodes) use logical replication
to share tables with every other node in the set.

We define this as a multi-master logical replication (MMLR) node-set.

<please refer to the attached node-set diagram>

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).
- Allows load balancing
- Allows rolling updates of nodes (e.g., logical replication works
between different major versions of PostgreSQL).
- Improves the availability of the system (e.g., no single point of failure)
- Improves performance (e.g., lower latencies for geographically local nodes)

2.0 MMLR AND POSTGRESQL

It is already possible to configure a kind of MMLR set in PostgreSQL
15 using PUB/SUB, but it is very restrictive because it can only work
when no two nodes operate on the same table. This is because when two
nodes try to share the same table then there becomes a circular
recursive problem where Node1 replicates data to Node2 which is then
replicated back to Node1 and so on.

To prevent the circular recursive problem Vignesh is developing a
patch [1]/messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com that introduces new SUBSCRIPTION options "local_only" (for
publishing only data originating at the publisher node) and
"copy_data=force". Using this patch, we have created a script [2]/messages/by-id/CAHut+PvY2P=UL-X6maMA5QxFKdcdciRRCKDH3j=_hO8u2OyRYg@mail.gmail.com
demonstrating how to set up all the above multi-node examples. An
overview of the necessary steps is given in the next section.

2.1 STEPS – Adding a new node N to an existing node-set

step 1. Prerequisites – Apply Vignesh’s patch [1]/messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com. All nodes in the
set must be visible to each other by a known CONNECTION. All shared
tables must already be defined on all nodes.

step 2. On node N do CREATE PUBLICATION pub_N FOR ALL TABLES

step 3. All other nodes then CREATE SUBSCRIPTION to PUBLICATION pub_N
with "local_only=on, copy_data=on" (this will replicate initial data
from the node N tables to every other node).

step 4. On node N, temporarily ALTER PUBLICATION pub_N to prevent
replication of 'truncate', then TRUNCATE all tables of node N, then
re-allow replication of 'truncate'.

step 5. On node N do CREATE SUBSCRIPTION to the publications of all
other nodes in the set
5a. Specify "local_only=on, copy_data=force" for exactly one of the
subscriptions (this will make the node N tables now have the same
data as the other nodes)
5b. Specify "local_only=on, copy_data=off" for all other subscriptions.

step 6. Result - Now changes to any table on any node should be
replicated to every other node in the set.

Note: Steps 4 and 5 need to be done within the same transaction to
avoid loss of data in case of some command failure. (Because we can't
perform create subscription in a transaction, we need to create the
subscription in a disabled mode first and then enable it in the
transaction).

2.2 DIFFICULTIES

Notice that it becomes increasingly complex to configure MMLR manually
as the number of nodes in the set increases. There are also some
difficulties such as
- dealing with initial table data
- coordinating the timing to avoid concurrent updates
- getting the SUBSCRIPTION options for copy_data exactly right.

3.0 PROPOSAL

To make the MMLR setup simpler, we propose to create a new API that
will hide all the step details and remove the burden on the user to
get it right without mistakes.

3.1 MOTIVATION
- MMLR (sharing the same tables) is not currently possible
- Vignesh's patch [1]/messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com makes MMLR possible, but the manual setup is
still quite difficult
- An MMLR implementation can solve the timing problems (e.g., using
Database Locking)

3.2 API

Preferably the API would be implemented as new SQL functions in
PostgreSQL core, however, implementation using a contrib module or
some new SQL syntax may also be possible.

SQL functions will be like below:
- pg_mmlr_set_create = create a new set, and give it a name
- pg_mmlr_node_attach = attach the current node to a specified set
- pg_mmlr_node_detach = detach a specified node from a specified set
- pg_mmlr_set_delete = delete a specified set

For example, internally the pg_mmlr_node_attach API function would
execute the equivalent of all the CREATE PUBLICATION, CREATE
SUBSCRIPTION, and TRUNCATE steps described above.

Notice this proposal has some external API similarities with the BDR
extension [3]https://www.enterprisedb.com/docs/bdr/latest/ (which also provides multi-master logical replication),
although we plan to implement it entirely using PostgreSQL’s PUB/SUB.

4.0 ACKNOWLEDGEMENTS

The following people have contributed to this proposal – Hayato
Kuroda, Vignesh C, Peter Smith, Amit Kapila.

5.0 REFERENCES

[1]: /messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com
[2]: /messages/by-id/CAHut+PvY2P=UL-X6maMA5QxFKdcdciRRCKDH3j=_hO8u2OyRYg@mail.gmail.com
[3]: https://www.enterprisedb.com/docs/bdr/latest/

[END]

~~~

One of my colleagues will post more detailed information later.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

node-sets.PNGimage/png; name=node-sets.PNGDownload
#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Peter Smith (#1)
Re: Multi-Master Logical Replication

On Thu, 2022-04-28 at 09:49 +1000, Peter Smith wrote:

To prevent the circular recursive problem Vignesh is developing a
patch [1] that introduces new SUBSCRIPTION options "local_only" (for
publishing only data originating at the publisher node) and
"copy_data=force". Using this patch, we have created a script [2]
demonstrating how to set up all the above multi-node examples. An
overview of the necessary steps is given in the next section.

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking or the databases from drifting apart.

Yours,
Laurenz Albe

#3Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Laurenz Albe (#2)
RE: Multi-Master Logical Replication

Dear Laurenz,

Thank you for your interest in our works!

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking or the databases from drifting apart.

Actually we don't have plans for developing the feature that avoids conflict.
We think that it should be done as core PUB/SUB feature, and
this module will just use that.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#4Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#3)
Re: Multi-Master Logical Replication

В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).

I've never heard how transactional-aware multimaster increases
write scalability. More over, usually even non-transactional
multimaster doesn't increase write scalability. At the best it
doesn't decrease.

That is because all hosts have to write all changes anyway. But
side cost increases due to increased network interchange and
interlocking (for transaction-aware MM) and increased latency.

В Чт, 28/04/2022 в 08:34 +0000, kuroda.hayato@fujitsu.com пишет:

Dear Laurenz,

Thank you for your interest in our works!

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking

Actually we don't have plans for developing the feature that avoids conflict.
We think that it should be done as core PUB/SUB feature, and
this module will just use that.

If you really want to have some proper isolation levels (
Read Committed? Repeatable Read?) and/or want to have
same data on each "master", there is no easy way. If you
think it will be "easy", you are already wrong.

Our company has MultiMaster which is built on top of
logical replication. It is even partially open source
( https://github.com/postgrespro/mmts ) , although some
core patches that have to be done for are not up to
date.

And it is second iteration of MM. First iteration were
not "simple" or "easy" already. But even that version had
the hidden bug: rare but accumulating data difference
between nodes. Attempt to fix this bug led to almost
full rewrite of multi-master.

(Disclaimer: I had no relation to both MM versions,
I just work in the same firm).

regards

---------

Yura Sokolov

#5vignesh C
vignesh21@gmail.com
In reply to: Yura Sokolov (#4)
Re: Multi-Master Logical Replication

On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).

I've never heard how transactional-aware multimaster increases
write scalability. More over, usually even non-transactional
multimaster doesn't increase write scalability. At the best it
doesn't decrease.

That is because all hosts have to write all changes anyway. But
side cost increases due to increased network interchange and
interlocking (for transaction-aware MM) and increased latency.

I agree it won't increase in all cases, but it will be better in a few
cases when the user works on different geographical regions operating
on independent schemas in asynchronous mode. Since the write node is
closer to the geographical zone, the performance will be better in a
few cases.

В Чт, 28/04/2022 в 08:34 +0000, kuroda.hayato@fujitsu.com пишет:

Dear Laurenz,

Thank you for your interest in our works!

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking

Actually we don't have plans for developing the feature that avoids conflict.
We think that it should be done as core PUB/SUB feature, and
this module will just use that.

If you really want to have some proper isolation levels (
Read Committed? Repeatable Read?) and/or want to have
same data on each "master", there is no easy way. If you
think it will be "easy", you are already wrong.

The synchronous_commit and synchronous_standby_names configuration
parameters will help in getting the same data across the nodes. Can
you give an example for the scenario where it will be difficult?

Regards,
Vignesh

#6Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: vignesh C (#5)
Re: Multi-Master Logical Replication

В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:

On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).

I've never heard how transactional-aware multimaster increases
write scalability. More over, usually even non-transactional
multimaster doesn't increase write scalability. At the best it
doesn't decrease.

That is because all hosts have to write all changes anyway. But
side cost increases due to increased network interchange and
interlocking (for transaction-aware MM) and increased latency.

I agree it won't increase in all cases, but it will be better in a few
cases when the user works on different geographical regions operating
on independent schemas in asynchronous mode. Since the write node is
closer to the geographical zone, the performance will be better in a
few cases.

From EnterpriseDB BDB page [1]https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance:

Adding more master nodes to a BDR Group does not result in
significant write throughput increase when most tables are
replicated because BDR has to replay all the writes on all nodes.
Because BDR writes are in general more effective than writes coming
from Postgres clients via SQL, some performance increase can be
achieved. Read throughput generally scales linearly with the number
of nodes.

And I'm sure EnterpriseDB does the best.

В Чт, 28/04/2022 в 08:34 +0000, kuroda.hayato@fujitsu.com пишет:

Dear Laurenz,

Thank you for your interest in our works!

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking

Actually we don't have plans for developing the feature that avoids conflict.
We think that it should be done as core PUB/SUB feature, and
this module will just use that.

If you really want to have some proper isolation levels (
Read Committed? Repeatable Read?) and/or want to have
same data on each "master", there is no easy way. If you
think it will be "easy", you are already wrong.

The synchronous_commit and synchronous_standby_names configuration
parameters will help in getting the same data across the nodes. Can
you give an example for the scenario where it will be difficult?

So, synchronous or asynchronous?
Synchronous commit on every master, every alive master or on quorum
of masters?

And it is not about synchronicity. It is about determinism at
conflicts.

If you have fully determenistic conflict resolution that works
exactly same way on each host, then it is possible to have same
data on each host. (But it will not be transactional.)And it seems EDB BDB achieved this.

Or if you have fully and correctly implemented one of distributed
transactions protocols.

[1]: https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance

regards

------

Yura Sokolov

#7Peter Smith
smithpb2250@gmail.com
In reply to: Yura Sokolov (#6)
Re: Multi-Master Logical Replication

On Fri, Apr 29, 2022 at 2:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:

On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).

I've never heard how transactional-aware multimaster increases
write scalability. More over, usually even non-transactional
multimaster doesn't increase write scalability. At the best it
doesn't decrease.

That is because all hosts have to write all changes anyway. But
side cost increases due to increased network interchange and
interlocking (for transaction-aware MM) and increased latency.

I agree it won't increase in all cases, but it will be better in a few
cases when the user works on different geographical regions operating
on independent schemas in asynchronous mode. Since the write node is
closer to the geographical zone, the performance will be better in a
few cases.

From EnterpriseDB BDB page [1]:

Adding more master nodes to a BDR Group does not result in
significant write throughput increase when most tables are
replicated because BDR has to replay all the writes on all nodes.
Because BDR writes are in general more effective than writes coming
from Postgres clients via SQL, some performance increase can be
achieved. Read throughput generally scales linearly with the number
of nodes.

And I'm sure EnterpriseDB does the best.

В Чт, 28/04/2022 в 08:34 +0000, kuroda.hayato@fujitsu.com пишет:

Dear Laurenz,

Thank you for your interest in our works!

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking

Actually we don't have plans for developing the feature that avoids conflict.
We think that it should be done as core PUB/SUB feature, and
this module will just use that.

If you really want to have some proper isolation levels (
Read Committed? Repeatable Read?) and/or want to have
same data on each "master", there is no easy way. If you
think it will be "easy", you are already wrong.

The synchronous_commit and synchronous_standby_names configuration
parameters will help in getting the same data across the nodes. Can
you give an example for the scenario where it will be difficult?

So, synchronous or asynchronous?
Synchronous commit on every master, every alive master or on quorum
of masters?

And it is not about synchronicity. It is about determinism at
conflicts.

If you have fully determenistic conflict resolution that works
exactly same way on each host, then it is possible to have same
data on each host. (But it will not be transactional.)And it seems EDB BDB achieved this.

Or if you have fully and correctly implemented one of distributed
transactions protocols.

[1] https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance

regards

------

Yura Sokolov

Thanks for your feedback.

This MMLR proposal was mostly just to create an interface making it
easier to use PostgreSQL core logical replication CREATE
PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
Otherwise, this is difficult for a user to do manually. (e.g.
difficulties as mentioned in section 2.2 of the original post [1]/messages/by-id/CAHut+PuwRAoWY9pz=Eubps3ooQCOBFiYPU9Yi=VB-U+yORU7OA@mail.gmail.com -
dealing with initial table data, coordinating the timing/locking to
avoid concurrent updates, getting the SUBSCRIPTION options for
copy_data exactly right etc)

At this time we have no provision for HA, nor for transaction
consistency awareness, conflict resolutions, node failure detections,
DDL replication etc. Some of the features like DDL replication are
currently being implemented [2]/messages/by-id/45d0d97c-3322-4054-b94f-3c08774bbd90@www.fastmail.com, so when committed it will become
available in the core, and can then be integrated into this module.

Once the base feature of the current MMLR proposal is done, perhaps it
can be extended in subsequent versions.

Probably our calling this “Multi-Master” has been
misleading/confusing, because that term implies much more to other
readers. We really only intended it to mean the ability to set up
logical replication across a set of nodes. Of course, we can rename
the proposal (and API) to something different if there are better
suggestions.

------
[1]: /messages/by-id/CAHut+PuwRAoWY9pz=Eubps3ooQCOBFiYPU9Yi=VB-U+yORU7OA@mail.gmail.com
[2]: /messages/by-id/45d0d97c-3322-4054-b94f-3c08774bbd90@www.fastmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#8vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#7)
Re: Multi-Master Logical Replication

On Fri, Apr 29, 2022 at 2:35 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Apr 29, 2022 at 2:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:

On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).

I've never heard how transactional-aware multimaster increases
write scalability. More over, usually even non-transactional
multimaster doesn't increase write scalability. At the best it
doesn't decrease.

That is because all hosts have to write all changes anyway. But
side cost increases due to increased network interchange and
interlocking (for transaction-aware MM) and increased latency.

I agree it won't increase in all cases, but it will be better in a few
cases when the user works on different geographical regions operating
on independent schemas in asynchronous mode. Since the write node is
closer to the geographical zone, the performance will be better in a
few cases.

From EnterpriseDB BDB page [1]:

Adding more master nodes to a BDR Group does not result in
significant write throughput increase when most tables are
replicated because BDR has to replay all the writes on all nodes.
Because BDR writes are in general more effective than writes coming
from Postgres clients via SQL, some performance increase can be
achieved. Read throughput generally scales linearly with the number
of nodes.

And I'm sure EnterpriseDB does the best.

В Чт, 28/04/2022 в 08:34 +0000, kuroda.hayato@fujitsu.com пишет:

Dear Laurenz,

Thank you for your interest in our works!

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking

Actually we don't have plans for developing the feature that avoids conflict.
We think that it should be done as core PUB/SUB feature, and
this module will just use that.

If you really want to have some proper isolation levels (
Read Committed? Repeatable Read?) and/or want to have
same data on each "master", there is no easy way. If you
think it will be "easy", you are already wrong.

The synchronous_commit and synchronous_standby_names configuration
parameters will help in getting the same data across the nodes. Can
you give an example for the scenario where it will be difficult?

So, synchronous or asynchronous?
Synchronous commit on every master, every alive master or on quorum
of masters?

And it is not about synchronicity. It is about determinism at
conflicts.

If you have fully determenistic conflict resolution that works
exactly same way on each host, then it is possible to have same
data on each host. (But it will not be transactional.)And it seems EDB BDB achieved this.

Or if you have fully and correctly implemented one of distributed
transactions protocols.

[1] https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance

regards

------

Yura Sokolov

Thanks for your feedback.

This MMLR proposal was mostly just to create an interface making it
easier to use PostgreSQL core logical replication CREATE
PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
Otherwise, this is difficult for a user to do manually. (e.g.
difficulties as mentioned in section 2.2 of the original post [1] -
dealing with initial table data, coordinating the timing/locking to
avoid concurrent updates, getting the SUBSCRIPTION options for
copy_data exactly right etc)

Different problems and how to solve each scenario is mentioned detailly in [1]/messages/by-id/CAA4eK1+co2cd8a6okgUD_pcFEHcc7mVc0k_RE2=6ahyv3WPRMg@mail.gmail.com.
It gets even more complex when there are more nodes associated, let's
consider the 3 node case:
Adding a new node node3 to the existing node1 and node2 when data is
present in existing nodes node1 and node2, the following steps are
required:
Create a publication in node3:
CREATE PUBLICATION pub_node3 for all tables;

Create a subscription in node1 to subscribe the changes from node3:
CREATE SUBSCRIPTION sub_node1_node3 CONNECTION 'dbname=foo host=node3
user=repuser' PUBLICATION pub_node3 WITH (copy_data = off, local_only
= on);

Create a subscription in node2 to subscribe the changes from node3:
CREATE SUBSCRIPTION sub_node2_node3 CONNECTION 'dbname=foo host=node3
user=repuser' PUBLICATION pub_node3 WITH (copy_data = off, local_only
= on);

Lock database at node2 and wait till walsender sends WAL to node1(upto
current lsn) to avoid any data loss because of node2's WAL not being
sent to node1. This lock needs to be held till the setup is complete.

Create a subscription in node3 to subscribe the changes from node1,
here copy_data is specified as force so that the existing table data
is copied during initial sync:
CREATE SUBSCRIPTION sub_node3_node1
CONNECTION 'dbname=foo host=node1 user=repuser'
PUBLICATION pub_node1
WITH (copy_data = force, local_only = on);

Create a subscription in node3 to subscribe the changes from node2:
CREATE SUBSCRIPTION sub_node3_node2
CONNECTION 'dbname=foo host=node2 user=repuser'
PUBLICATION pub_node2
WITH (copy_data = off, local_only = on);

If data is present in node3 few more additional steps are required: a)
copying node3 data to node1 b) copying node3 data to node2 c) altering
publication not to send truncate operation d) truncate the data in
node3 e) altering the publication to include sending of truncate.

[1]: /messages/by-id/CAA4eK1+co2cd8a6okgUD_pcFEHcc7mVc0k_RE2=6ahyv3WPRMg@mail.gmail.com

Regards,
Vignesh

#9vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#1)
Re: Multi-Master Logical Replication

On Thu, Apr 28, 2022 at 5:20 AM Peter Smith <smithpb2250@gmail.com> wrote:

MULTI-MASTER LOGICAL REPLICATION

1.0 BACKGROUND

Let’s assume that a user wishes to set up a multi-master environment
so that a set of PostgreSQL instances (nodes) use logical replication
to share tables with every other node in the set.

We define this as a multi-master logical replication (MMLR) node-set.

<please refer to the attached node-set diagram>

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).
- Allows load balancing
- Allows rolling updates of nodes (e.g., logical replication works
between different major versions of PostgreSQL).
- Improves the availability of the system (e.g., no single point of failure)
- Improves performance (e.g., lower latencies for geographically local nodes)

2.0 MMLR AND POSTGRESQL

It is already possible to configure a kind of MMLR set in PostgreSQL
15 using PUB/SUB, but it is very restrictive because it can only work
when no two nodes operate on the same table. This is because when two
nodes try to share the same table then there becomes a circular
recursive problem where Node1 replicates data to Node2 which is then
replicated back to Node1 and so on.

To prevent the circular recursive problem Vignesh is developing a
patch [1] that introduces new SUBSCRIPTION options "local_only" (for
publishing only data originating at the publisher node) and
"copy_data=force". Using this patch, we have created a script [2]
demonstrating how to set up all the above multi-node examples. An
overview of the necessary steps is given in the next section.

2.1 STEPS – Adding a new node N to an existing node-set

step 1. Prerequisites – Apply Vignesh’s patch [1]. All nodes in the
set must be visible to each other by a known CONNECTION. All shared
tables must already be defined on all nodes.

step 2. On node N do CREATE PUBLICATION pub_N FOR ALL TABLES

step 3. All other nodes then CREATE SUBSCRIPTION to PUBLICATION pub_N
with "local_only=on, copy_data=on" (this will replicate initial data
from the node N tables to every other node).

step 4. On node N, temporarily ALTER PUBLICATION pub_N to prevent
replication of 'truncate', then TRUNCATE all tables of node N, then
re-allow replication of 'truncate'.

step 5. On node N do CREATE SUBSCRIPTION to the publications of all
other nodes in the set
5a. Specify "local_only=on, copy_data=force" for exactly one of the
subscriptions (this will make the node N tables now have the same
data as the other nodes)
5b. Specify "local_only=on, copy_data=off" for all other subscriptions.

step 6. Result - Now changes to any table on any node should be
replicated to every other node in the set.

Note: Steps 4 and 5 need to be done within the same transaction to
avoid loss of data in case of some command failure. (Because we can't
perform create subscription in a transaction, we need to create the
subscription in a disabled mode first and then enable it in the
transaction).

2.2 DIFFICULTIES

Notice that it becomes increasingly complex to configure MMLR manually
as the number of nodes in the set increases. There are also some
difficulties such as
- dealing with initial table data
- coordinating the timing to avoid concurrent updates
- getting the SUBSCRIPTION options for copy_data exactly right.

3.0 PROPOSAL

To make the MMLR setup simpler, we propose to create a new API that
will hide all the step details and remove the burden on the user to
get it right without mistakes.

3.1 MOTIVATION
- MMLR (sharing the same tables) is not currently possible
- Vignesh's patch [1] makes MMLR possible, but the manual setup is
still quite difficult
- An MMLR implementation can solve the timing problems (e.g., using
Database Locking)

3.2 API

Preferably the API would be implemented as new SQL functions in
PostgreSQL core, however, implementation using a contrib module or
some new SQL syntax may also be possible.

SQL functions will be like below:
- pg_mmlr_set_create = create a new set, and give it a name
- pg_mmlr_node_attach = attach the current node to a specified set
- pg_mmlr_node_detach = detach a specified node from a specified set
- pg_mmlr_set_delete = delete a specified set

For example, internally the pg_mmlr_node_attach API function would
execute the equivalent of all the CREATE PUBLICATION, CREATE
SUBSCRIPTION, and TRUNCATE steps described above.

Notice this proposal has some external API similarities with the BDR
extension [3] (which also provides multi-master logical replication),
although we plan to implement it entirely using PostgreSQL’s PUB/SUB.

4.0 ACKNOWLEDGEMENTS

The following people have contributed to this proposal – Hayato
Kuroda, Vignesh C, Peter Smith, Amit Kapila.

5.0 REFERENCES

[1] /messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com
[2] /messages/by-id/CAHut+PvY2P=UL-X6maMA5QxFKdcdciRRCKDH3j=_hO8u2OyRYg@mail.gmail.com
[3] https://www.enterprisedb.com/docs/bdr/latest/

[END]

~~~

One of my colleagues will post more detailed information later.

MMLR is changed to LRG(Logical replication group) to avoid confusions.

The LRG functionality will be implemented as given below:
The lrg contrib module provides a set of API to allow setting up
bi-directional logical replication among different nodes. The lrg
stands for Logical Replication Group.
To use this functionality shared_preload_libraries must be set to lrg like:
shared_preload_libraries = lrg
A new process "lrg launcher" is added which will be launched when the
extension is created. This process is responsible for checking if user
has created new logical replication group or if the user is attaching
a new node to the logical replication group or detach a node or drop a
logical replication group and if so, then launches another new “lrg
worker” for the corresponding database.
The new process "lrg worker" is responsible for handling the core
tasks of lrg_create, lrg_node_attach, lrg_node_detach and lrg_drop
functionality.
The “lrg worker” is required here because there are a lot of steps
involved in this process like create publication, create subscription,
alter publication, lock table, etc. If there is a failure during any
of the process, the worker will be restarted and is responsible to
continue the operation from where it left off to completion.
The following new tables were added to maintain the logical
replication group related information:
-- pg_lrg_info table to maintain the logical replication group information.
CREATE TABLE lrg.pg_lrg_info
(
groupname text PRIMARY KEY, -- name of the logical replication group
pubtype text – type of publication(ALL TABLES, SCHEMA, TABLE)
currently only “ALL TABLES” is supported
);

-- pg_ lrg_nodes table to maintain the node information that are
members of the logical replication group.
CREATE TABLE lrg.pg_lrg_nodes
(
nodeid text PRIMARY KEY, -- node id (actual node_id format is
still not finalized)
groupname text REFERENCES pg_lrg_info(groupname), -- name of the
logical replication group
dbid oid NOT NULL, -- db id
status text NOT NULL, -- status of the node
nodename text, -- node name
localconn text NOT NULL, -- local connection string
upstreamconn text – upstream connection string to connect to
another node already in the logical replication group
);

-- pg_ lrg_pub table to maintain the publications that were created
for this node.
CREATE TABLE lrg.pg_lrg_pub
(
groupname text REFERENCES pg_lrg_info(groupname), -- name of the
logical replication group
pubid oid NOT NULL – oid of the publication
);

-- pg_lrg_sub table to maintain the subscriptions that were created
for this node.
CREATE TABLE lrg.pg_lrg_sub
(
groupname text REFERENCES pg_lrg_info(groupname), -- name of the
logical replication group
subid oid NOT NULL– oid of the subscription
);

The following functionality was added to support the various logical
replication group functionalities:
lrg_create(group_name text, pub_type text, local_connection_string
text, node_name text)
lrg _node_attach(group_name text, local_connection_string text,
upstream_connection_string text, node_name text)
lrg_node_detach(group_name text, node_name text)
lrg_drop(group_name text)
-----------------------------------------------------------------------------------------------------------------------------------

lrg_create – This function creates a logical replication group as
specified in group_name.
example:
postgres=# SELECT lrg.lrg_create('test', 'FOR ALL TABLES',
'user=postgres port=5432', 'testnode1');

This function adds a logical replication group “test” with pubtype as
“FOR ALL TABLES” to pg_lrg_info like given below:
postgres=# select * from lrg. pg_lrg_info;
groupname | pubtype
----------+------------------
test | FOR ALL TABLES
(1 row)

It adds node information which includes the node id, database id,
status, node name, connection string and upstream connection string to
pg_lrg_nodes like given below:
postgres=# select * from lrg.pg_lrg_nodes ;
nodeid | groupname |
dbid | status | nodename | localconn | upstreamconn
-------------------------------------------------------------+------+--------+-----------+-----------------------------------------+-----------------------------------------
70934590432710321605user=postgres port=5432 | test | 5 | ready |
testnode1 | user=postgres port=5432 |
(1 row)

The “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables.
2) It will create the publication in the current node.
3) It will change the (pg_lrg_nodes) status from init to createpublication.
4) It will unlock the pg_lrg_info and pg_lrg_nodes tables
5) It will change the (pg_lrg_nodes) status from createpublication to ready.
-----------------------------------------------------------------------------------------------------------------------------------

lrg_node_attach – Attach the specified node to the specified logical
replication group.
example:
postgres=# SELECT lrg.lrg_node_attach('test', 'user=postgres
port=9999', 'user=postgres port=5432', 'testnode2')
This function adds logical replication group “test” with pubtype as
“FOR ALL TABLES” to pg_lrg_info in the new node like given below:
postgres=# select * from pg_lrg_info;
groupname | pubtype
----------+------------------
test | FOR ALL TABLES
(1 row)

This is the same group name that was added during lrg_create in the
create node. Now this information will be available in the new node
too. This information will help the user to attach to any of the nodes
present in the logical replication group.
It adds node information which includes the node id, database id,
status, node name, connection string and upstream connection string of
the current node and the other nodes that are part of the logical
replication group to pg_lrg_nodes like given below:
postgres=# select * from lrg.pg_lrg_nodes ;
nodeid | groupname |
dbid | status | nodename | localconn | upstreamconn
-------------------------------------------------------------+------+--------+-----------+-----------------------------------------+-----------------------------------------
70937999584732760095user=vignesh dbname=postgres port=9999 | test |
5 | ready | testnode2 | user=vignesh dbname=postgres port=9999 |
user=vignesh dbname=postgres port=5432
70937999523629205245user=vignesh dbname=postgres port=5432 | test |
5 | ready | testnode1 | user=vignesh dbname=postgres port=5432 |
(2 rows)

It will use the upstream connection to connect to the upstream node
and get the nodes that are part of the logical replication group.
Note: The nodeid used here is for illustrative purpose, actual nodeid
format is still not finalized.
For this API the “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables.
2) It will connect to the upstream node specified and get the list of
other nodes present in the logical replication group.
3) It will connect to the remaining nodes and lock the database so
that no new operations are performed.
4) It will wait in the upstream node till it reaches the latest lsn of
the remaining nodes, this is somewhat similar to wait_for_catchup
function in tap tests.
5) It will change the status (pg_lrg_nodes) from init to waitforlsncatchup.
6) It will create the publication in the current node.
7) It will change the status (pg_lrg_nodes) from waitforlsncatchup to
createpublication.
8) It will create a subscription in all the remaining nodes to get the
data from new node.
9) It will change the status (pg_lrg_nodes) from createpublication to
createsubscription.
10) It will alter the publication not to replicate truncate operation.
11) It will truncate the table.
12) It will alter the publication to include sending the truncate operation.
13) It will create a subscription in the current node to subscribe the
data with copy_data force.
14) It will create a subscription in the remaining nodes to subscribe
the data with copy_data off.
15) It will unlock the database in all the remaining nodes.
16) It will unlock the pg_lrg_info and pg_lrg_nodes tables.
17) It will change the status (pg_lrg_nodes) from createsubscription to ready.

The status will be useful to display the progress of the operation to
the user and help in failure handling to continue the operation from
the state it had failed.
-----------------------------------------------------------------------------------------------------------------------------------

lrg_node_detach – detach a node from the logical replication group.
example:
postgres=# SELECT lrg.lrg_node_detach('test', 'testnode');
For this API the “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables.
2) It will get the list of other nodes present in the logical replication group.
3) It will connect to the remaining nodes and lock the database so
that no new operations are performed.
4) It will drop the subscription in all the nodes corresponding to
this node of the cluster.
5) It will drop the publication in the current node.
6) It will remove all the data associated with this logical
replication group from pg_lrg_* tables.
7) It will unlock the pg_lrg_info and pg_lrg_nodes tables.
-----------------------------------------------------------------------------------------------------------------------------------

lrg_drop - drop a group from logical replication groups.
example:
postgres=# SELECT lrg.lrg_drop('test');

This function removes the group specified from the logical replication
groups. This function must be executed at the member of a given
logical replication group.
For this API the “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables..
2) DROP PUBLICATION of this node that was created for this logical
replication group.
3) Remove all data from the logical replication group system table
associated with the logical replication group.
4) It will unlock the pg_lrg_info and pg_lrg_nodes tables.

If there are no objections the API can be implemented as SQL functions
in PostgreSQL core and the new tables can be created as system tables.

Thoughts?

Regards,
Vignesh

#10Bruce Momjian
bruce@momjian.us
In reply to: Peter Smith (#7)
Re: Multi-Master Logical Replication

On Fri, Apr 29, 2022 at 07:05:11PM +1000, Peter Smith wrote:

This MMLR proposal was mostly just to create an interface making it
easier to use PostgreSQL core logical replication CREATE
PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
Otherwise, this is difficult for a user to do manually. (e.g.
difficulties as mentioned in section 2.2 of the original post [1] -
dealing with initial table data, coordinating the timing/locking to
avoid concurrent updates, getting the SUBSCRIPTION options for
copy_data exactly right etc)

At this time we have no provision for HA, nor for transaction
consistency awareness, conflict resolutions, node failure detections,
DDL replication etc. Some of the features like DDL replication are
currently being implemented [2], so when committed it will become
available in the core, and can then be integrated into this module.

Uh, without these features, what workload would this help with? I think
you made the mistake of jumping too far into implementation without
explaining the problem you are trying to solve. The TODO list has this
ordering:

https://wiki.postgresql.org/wiki/Todo
Desirability -> Design -> Implement -> Test -> Review -> Commit

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#10)
Re: Multi-Master Logical Replication

On Sat, May 14, 2022 at 12:33 AM Bruce Momjian <bruce@momjian.us> wrote:

Uh, without these features, what workload would this help with?

To allow replication among multiple nodes when some of the nodes may
have pre-existing data. This work plans to provide simple APIs to
achieve that. Now, let me try to explain the difficulties users can
face with the existing interface. It is simple to set up replication
among various nodes when they don't have any pre-existing data but
even in that case if the user operates on the same table at multiple
nodes, the replication will lead to an infinite loop and won't
proceed. The example in email [1]/messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com demonstrates that and the patch in
that thread attempts to solve it. I have mentioned that problem
because this work will need that patch.

Now, let's take a simple case where two nodes have the same table
which has some pre-existing data:

Node-1:
Table t1 (c1 int) has data
1, 2, 3, 4

Node-2:
Table t1 (c1 int) has data
5, 6, 7, 8

If we have to set up replication among the above two nodes using
existing interfaces, it could be very tricky. Say user performs
operations like below:

Node-1
#Publication for t1
Create Publication pub1 For Table t1;

Node-2
#Publication for t1,
Create Publication pub1_2 For Table t1;

Node-1:
Create Subscription sub1 Connection '<node-2 details>' Publication pub1_2;

Node-2:
Create Subscription sub1_2 Connection '<node-1 details>' Publication pub1;

After this the data will be something like this:
Node-1:
1, 2, 3, 4, 5, 6, 7, 8

Node-2:
1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8

So, you can see that data on Node-2 (5, 6, 7, 8) is duplicated. In
case, table t1 has a unique key, it will lead to a unique key
violation and replication won't proceed. Here, I have assumed that we
already have functionality for the patch in email [1]/messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com, otherwise,
replication will be an infinite loop replicating the above data again
and again. Now one way to achieve this could be that we can ask users
to stop all operations on both nodes before starting replication
between those and take data dumps of tables from each node they want
to replicate and restore them to other nodes. Then use the above
commands to set up replication and allow to start operations on those
nodes. The other possibility for users could be as below. Assume, we
have already created publications as in the above example, and then:

Node-2:
Create Subscription sub1_2 Connection '<node-1 details>' Publication pub1;

#Wait for the initial sync of table t1 to finish. Users can ensure
that by checking 'srsubstate' in pg_subscription_rel.

Node-1:
Begin;
# Disallow truncates to be published and then truncate the table
Alter Publication pub1 Set (publish = 'insert, update, delete');
Truncate t1;
Create Subscription sub1 Connection '<node-2 details>' Publication pub1_2;
Alter Publication pub1 Set (publish = 'insert, update, delete, truncate');
Commit;

This will become more complicated when more than two nodes are
involved, see the example provided for the three nodes case [2]/messages/by-id/CALDaNm3aD3nZ0HWXA8V435AGMvORyR5-mq2FzqQdKQ8CPomB5Q@mail.gmail.com. Can
you think of some other simpler way to achieve the same? If not, I
don't think the current way is ideal and even users won't prefer that.
I am not telling that the APIs proposed in this thread is the only or
best way to achieve the desired purpose but I think we should do
something to allow users to easily set up replication among multiple
nodes.

[1]: /messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com
[2]: /messages/by-id/CALDaNm3aD3nZ0HWXA8V435AGMvORyR5-mq2FzqQdKQ8CPomB5Q@mail.gmail.com

--
With Regards,
Amit Kapila.

#12Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#11)
RE: Multi-Master Logical Replication

Hi hackers,

I created a small PoC. Please see the attached patches.

REQUIREMENT

Before patching them, patches in [1]https://commitfest.postgresql.org/38/3610/ must also be applied.

DIFFERENCES FROM PREVIOUS DESCRIPTIONS

* LRG is now implemented as SQL functions, not as a contrib module.
* New tables are added as system catalogs. Therefore, added tables have oid column.
* The node_id is the strcat of system identifier and dbid.

HOW TO USE

In the document patch, a subsection 'Example' was added for understanding LRG. In short, we can do

1. lrg_create on one node
2. lrg_node_attach on another node

Also attached is a test script that constructs a three-nodes system.

LIMITATIONS

This feature is under development, so there are many limitations for use case.

* The function for detaching a node from a group is not implemented.
* The function for removing a group is not implemented.
* LRG does not lock system catalogs and databases. Concurrent operations may cause inconsistent state.
* LRG does not wait until the upstream node reaches the latest lsn of the remaining nodes.
* LRG does not support initial data sync. That is, it can work well only when all nodes do not have initial data.

[1]: https://commitfest.postgresql.org/38/3610/

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

test.shapplication/octet-stream; name=test.shDownload
v1-0001-PoC-implement-LRG.patchapplication/octet-stream; name=v1-0001-PoC-implement-LRG.patchDownload+1983-3
v1-0002-add-doc.patchapplication/octet-stream; name=v1-0002-add-doc.patchDownload+646-1
#13Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#12)
RE: Multi-Master Logical Replication

Hi hackers,

[1]: has changed the name of the parameter, so I rebased the patch. Furthermore I implemented the first version of lrg_node_detach and lrg_drop functions, and some code comments are fixed.
Furthermore I implemented the first version of lrg_node_detach and lrg_drop functions,
and some code comments are fixed.

0001 and 0002 were copied from the [1]has changed the name of the parameter, so I rebased the patch. Furthermore I implemented the first version of lrg_node_detach and lrg_drop functions, and some code comments are fixed., they were attached for the cfbot.
Please see 0003 and 0004 for LRG related codes.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v2-0003-PoC-implement-LRG.patchapplication/octet-stream; name=v2-0003-PoC-implement-LRG.patchDownload+2275-3
v2-0004-add-doc.patchapplication/octet-stream; name=v2-0004-add-doc.patchDownload+741-1
v14-0001-Skip-replication-of-non-local-data.patchapplication/octet-stream; name=v14-0001-Skip-replication-of-non-local-data.patchDownload+440-58
v14-0002-Support-force-option-for-copy_data-check-and-thr.patchapplication/octet-stream; name=v14-0002-Support-force-option-for-copy_data-check-and-thr.patchDownload+825-75
#14Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#13)
RE: Multi-Master Logical Replication

Sorry, I forgot to attach the test script.
For cfbot I attached again all files. Sorry for the noise.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v2-0004-add-doc.patchapplication/octet-stream; name=v2-0004-add-doc.patchDownload+741-1
v14-0001-Skip-replication-of-non-local-data.patchapplication/octet-stream; name=v14-0001-Skip-replication-of-non-local-data.patchDownload+440-58
v14-0002-Support-force-option-for-copy_data-check-and-thr.patchapplication/octet-stream; name=v14-0002-Support-force-option-for-copy_data-check-and-thr.patchDownload+825-75
test.shapplication/octet-stream; name=test.shDownload
v2-0003-PoC-implement-LRG.patchapplication/octet-stream; name=v2-0003-PoC-implement-LRG.patchDownload+2275-3
#15Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#11)
Re: Multi-Master Logical Replication

On Sat, May 14, 2022 at 12:20:05PM +0530, Amit Kapila wrote:

On Sat, May 14, 2022 at 12:33 AM Bruce Momjian <bruce@momjian.us> wrote:

Uh, without these features, what workload would this help with?

To allow replication among multiple nodes when some of the nodes may
have pre-existing data. This work plans to provide simple APIs to
achieve that. Now, let me try to explain the difficulties users can
face with the existing interface. It is simple to set up replication
among various nodes when they don't have any pre-existing data but
even in that case if the user operates on the same table at multiple
nodes, the replication will lead to an infinite loop and won't
proceed. The example in email [1] demonstrates that and the patch in
that thread attempts to solve it. I have mentioned that problem
because this work will need that patch.

...

This will become more complicated when more than two nodes are
involved, see the example provided for the three nodes case [2]. Can
you think of some other simpler way to achieve the same? If not, I
don't think the current way is ideal and even users won't prefer that.
I am not telling that the APIs proposed in this thread is the only or
best way to achieve the desired purpose but I think we should do
something to allow users to easily set up replication among multiple
nodes.

You still have not answered my question above. "Without these features,
what workload would this help with?" You have only explained how the
patch would fix one of the many larger problems.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#15)
Re: Multi-Master Logical Replication

On Tue, May 24, 2022 at 5:57 PM Bruce Momjian <bruce@momjian.us> wrote:

On Sat, May 14, 2022 at 12:20:05PM +0530, Amit Kapila wrote:

On Sat, May 14, 2022 at 12:33 AM Bruce Momjian <bruce@momjian.us> wrote:

Uh, without these features, what workload would this help with?

To allow replication among multiple nodes when some of the nodes may
have pre-existing data. This work plans to provide simple APIs to
achieve that. Now, let me try to explain the difficulties users can
face with the existing interface. It is simple to set up replication
among various nodes when they don't have any pre-existing data but
even in that case if the user operates on the same table at multiple
nodes, the replication will lead to an infinite loop and won't
proceed. The example in email [1] demonstrates that and the patch in
that thread attempts to solve it. I have mentioned that problem
because this work will need that patch.

...

This will become more complicated when more than two nodes are
involved, see the example provided for the three nodes case [2]. Can
you think of some other simpler way to achieve the same? If not, I
don't think the current way is ideal and even users won't prefer that.
I am not telling that the APIs proposed in this thread is the only or
best way to achieve the desired purpose but I think we should do
something to allow users to easily set up replication among multiple
nodes.

You still have not answered my question above. "Without these features,
what workload would this help with?" You have only explained how the
patch would fix one of the many larger problems.

It helps with setting up logical replication among two or more nodes
(data flows both ways) which is important for use cases where
applications are data-aware. For such apps, it will be beneficial to
always send and retrieve data to local nodes in a geographically
distributed database. Now, for such apps, to get 100% consistent data
among nodes, one needs to enable synchronous_mode (aka set
synchronous_standby_names) but if that hurts performance and the data
is for analytical purposes then one can use it in asynchronous mode.
Now, for such cases, if the local node goes down, the other master
node can be immediately available to use, sure it may slow down the
operations for some time till the local node come-up. For such apps,
later it will be also easier to perform online upgrades.

Without this, if the user tries to achieve the same via physical
replication by having two local nodes, it can take quite long before
the standby can be promoted to master and local reads/writes will be
much costlier.

--
With Regards,
Amit Kapila.

#17Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#16)
Re: Multi-Master Logical Replication

On Wed, May 25, 2022 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 24, 2022 at 5:57 PM Bruce Momjian <bruce@momjian.us> wrote:

On Sat, May 14, 2022 at 12:20:05PM +0530, Amit Kapila wrote:

On Sat, May 14, 2022 at 12:33 AM Bruce Momjian <bruce@momjian.us> wrote:

Uh, without these features, what workload would this help with?

To allow replication among multiple nodes when some of the nodes may
have pre-existing data. This work plans to provide simple APIs to
achieve that. Now, let me try to explain the difficulties users can
face with the existing interface. It is simple to set up replication
among various nodes when they don't have any pre-existing data but
even in that case if the user operates on the same table at multiple
nodes, the replication will lead to an infinite loop and won't
proceed. The example in email [1] demonstrates that and the patch in
that thread attempts to solve it. I have mentioned that problem
because this work will need that patch.

...

This will become more complicated when more than two nodes are
involved, see the example provided for the three nodes case [2]. Can
you think of some other simpler way to achieve the same? If not, I
don't think the current way is ideal and even users won't prefer that.
I am not telling that the APIs proposed in this thread is the only or
best way to achieve the desired purpose but I think we should do
something to allow users to easily set up replication among multiple
nodes.

You still have not answered my question above. "Without these features,
what workload would this help with?" You have only explained how the
patch would fix one of the many larger problems.

It helps with setting up logical replication among two or more nodes
(data flows both ways) which is important for use cases where
applications are data-aware. For such apps, it will be beneficial to
always send and retrieve data to local nodes in a geographically
distributed database. Now, for such apps, to get 100% consistent data
among nodes, one needs to enable synchronous_mode (aka set
synchronous_standby_names) but if that hurts performance and the data
is for analytical purposes then one can use it in asynchronous mode.
Now, for such cases, if the local node goes down, the other master
node can be immediately available to use, sure it may slow down the
operations for some time till the local node come-up. For such apps,
later it will be also easier to perform online upgrades.

Without this, if the user tries to achieve the same via physical
replication by having two local nodes, it can take quite long before
the standby can be promoted to master and local reads/writes will be
much costlier.

As mentioned above, the LRG idea might be a useful addition to logical
replication for configuring certain types of "data-aware"
applications.

LRG for data-aware apps (e.g. sensor data)
------------------------------------------
Consider an example where there are multiple weather stations for a
country. Each weather station is associated with a PostgreSQL node and
inserts the local sensor data (e.g wind/rain/sunshine etc) once a
minute to some local table. The row data is identified by some station
ID.

- Perhaps there are many nodes.

- Loss of a single row of replicated sensor data if some node goes
down is not a major problem for this sort of application.

- Benefits of processing data locally can be realised.

- Using LRG simplifies the setup/sharing of the data across all group
nodes via a common table.

~~

LRG makes setup easier
----------------------
Although it is possible already (using Vignesh's "infinite recursion"
WIP patch [1]/messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com) to set up this kind of environment using logical
replication, as the number of nodes grows it becomes more and more
difficult to do it. For each new node, there needs to be N-1 x CREATE
SUBSCRIPTION for the other group nodes, meaning the connection details
for every other node also must be known up-front for the script.

OTOH, the LRG API can simplify all this, removing the user's burden
and risk of mistakes. Also, LRG only needs to know how to reach just 1
other node in the group (the implementation will discover all the
other node connection details internally).

~~

LRG can handle initial table data
--------------------------------
If the joining node (e.g. a new weather station) already has some
initial local sensor data then sharing that initial data manually with
all the other nodes requires some tricky steps. LRG can hide all this
complexity behind the API, so it is not a user problem anymore.

------
[1]: /messages/by-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9=9orqubhjcQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#18Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#16)
Re: Multi-Master Logical Replication

On Wed, May 25, 2022 at 12:13:17PM +0530, Amit Kapila wrote:

You still have not answered my question above. "Without these features,
what workload would this help with?" You have only explained how the
patch would fix one of the many larger problems.

It helps with setting up logical replication among two or more nodes
(data flows both ways) which is important for use cases where
applications are data-aware. For such apps, it will be beneficial to

That does make sense, thanks.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#19Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Bruce Momjian (#18)
RE: Multi-Master Logical Replication

Dear hackers,

I added documentation more and tap-tests about LRG.
Same as previous e-mail, 0001 and 0002 are copied from [1]https://commitfest.postgresql.org/38/3610/.

Following lists are the TODO of patches, they will be solved one by one.

## Functional

* implement a new state "waitforlsncatchup",
that waits until the upstream node receives the latest lsn of the remaining nodes,
* implement an over-node locking mechanism
* implement operations that shares initial data
* implement mechanisms to avoid concurrent API execution

Note that tap-test must be also added if above are added.

## Implemental

* consider failure-handing while executing APIs
* add error codes for LRG
* move elog() to ereport() for native language support
* define pg_lrg_nodes that has NULL-able attribute as proper style

[1]: https://commitfest.postgresql.org/38/3610/

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v3-0003-PoC-implement-LRG.patchapplication/octet-stream; name=v3-0003-PoC-implement-LRG.patchDownload+2341-3
v3-0004-add-tap-tests.patchapplication/octet-stream; name=v3-0004-add-tap-tests.patchDownload+556-2
v3-0005-add-documents.patchapplication/octet-stream; name=v3-0005-add-documents.patchDownload+860-1
v16-0001-Skip-replication-of-non-local-data.patchapplication/octet-stream; name=v16-0001-Skip-replication-of-non-local-data.patchDownload+433-58
v16-0002-Support-force-option-for-copy_data-check-and-thr.patchapplication/octet-stream; name=v16-0002-Support-force-option-for-copy_data-check-and-thr.patchDownload+825-75
#20Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#18)
Re: Multi-Master Logical Replication

On Wed, May 25, 2022 at 10:32:50PM -0400, Bruce Momjian wrote:

On Wed, May 25, 2022 at 12:13:17PM +0530, Amit Kapila wrote:

You still have not answered my question above. "Without these features,
what workload would this help with?" You have only explained how the
patch would fix one of the many larger problems.

It helps with setting up logical replication among two or more nodes
(data flows both ways) which is important for use cases where
applications are data-aware. For such apps, it will be beneficial to

That does make sense, thanks.

Uh, thinking some more, why would anyone set things up this way ---
having part of a table being primary on one server and a different part
of the table be a subscriber. Seems it would be simpler and safer to
create two child tables and have one be primary on only one server.
Users can access both tables using the parent.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#20)
#22Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#21)
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#22)
#24Peter Smith
smithpb2250@gmail.com
In reply to: Bruce Momjian (#22)
#25Bruce Momjian
bruce@momjian.us
In reply to: Peter Smith (#24)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#25)
#27Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Peter Smith (#17)
#28Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Peter Smith (#1)
#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Bharath Rupireddy (#28)
#30Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amit Kapila (#29)
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Bharath Rupireddy (#30)
#32r.takahashi_2@fujitsu.com
r.takahashi_2@fujitsu.com
In reply to: Amit Kapila (#31)
#33Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: r.takahashi_2@fujitsu.com (#32)
#34r.takahashi_2@fujitsu.com
r.takahashi_2@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#33)
#35Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: r.takahashi_2@fujitsu.com (#34)