Logical replication and multimaster

Started by Konstantin Knizhnikover 10 years ago40 messageshackers

k.knizhnik@postgrespro.ru

over 10 years ago

Hello all,

We have implemented ACID multimaster based on logical replication and
our DTM (distributed transaction manager) plugin.
Good news is that it works and no inconsistency is detected.
But unfortunately it is very very slow...

At standalone PostgreSQL I am able to achieve about 30000 TPS with 10
clients performing simple depbit-credit transactions.
And with multimaster consisting of three nodes spawned at the same
system I got about 100 (one hundred) TPS.
There are two main reasons of such awful performance:

1. Logical replication serializes all transactions: there is single
connection between wal-sender and receiver BGW.
2. 2PC synchronizes transaction commit at all nodes.

None of these two reasons are show stoppers themselves.
If we remove DTM and do asynchronous logical replication then
performance of multimaster is increased to 6000 TPS
(please notice that in this test all multimaster node are spawned at the
same system, sharing its resources,
so 6k is not bad result comparing with 30k at standalone system).
And according to 2ndquadrant results, BDR performance is very close to
hot standby.

On the other hand our previous experiments with DTM shows only about 2
times slowdown comparing with vanilla PostgreSQL.
But result of combining DTM and logical replication is frustrating.

I wonder if it is principle limitation of logical replication approach
which is efficient only for asynchronous replication or it can be
somehow tuned/extended to efficiently support synchronous replication?

We have also considered alternative approaches:
1. Statement based replication.
2. Trigger-based replication.
3. Replication using custom nodes.

In case of statement based replication it is hard to guarantee identity
of of data at different nodes.
Approaches 2 and 3 are much harder to implement and requiring to
"reinvent" substantial part of logical replication.
Them also require some kind of connection pool which can be used to send
replicated transactions to the peer nodes (to avoid serialization of
parallel transactions as in case of logical replication).

But looks like there is not so much sense in having multiple network
connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide
parallel execution of received transactions at destination side. But it
seems to be also nontrivial. We have now in PostgreSQL some
infrastructure for background works, but there is still no abstraction
of workers pool and job queue which can provide simple way to organize
parallel execution of some jobs. I wonder if somebody is working now on
it or we should try to propose our solution?

Best regards,
Konstantin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Logical replication and multimaster

On Mon, Nov 30, 2015 at 11:20 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

We have implemented ACID multimaster based on logical replication and our
DTM (distributed transaction manager) plugin.
Good news is that it works and no inconsistency is detected.
But unfortunately it is very very slow...

At standalone PostgreSQL I am able to achieve about 30000 TPS with 10
clients performing simple depbit-credit transactions.
And with multimaster consisting of three nodes spawned at the same system I
got about 100 (one hundred) TPS.
There are two main reasons of such awful performance:

1. Logical replication serializes all transactions: there is single
connection between wal-sender and receiver BGW.
2. 2PC synchronizes transaction commit at all nodes.

None of these two reasons are show stoppers themselves.
If we remove DTM and do asynchronous logical replication then performance of
multimaster is increased to 6000 TPS
(please notice that in this test all multimaster node are spawned at the
same system, sharing its resources,
so 6k is not bad result comparing with 30k at standalone system).
And according to 2ndquadrant results, BDR performance is very close to hot
standby.

Logical decoding only begins decoding a transaction once the
transaction is complete. So I would guess that the sequence of
operations here is something like this - correct me if I'm wrong:

1. Do the transaction.
2. PREPARE.
3. Replay the transaction.
4. PREPARE the replay.
5. COMMIT PREPARED on original machine.
6. COMMIT PREPARED on replica.

Step 3 introduces latency proportional to the amount of work the
transaction did, which could be a lot. If you were doing synchronous
physical replication, the replay of the COMMIT record would only need
to wait for the replay of the commit record itself. But with
synchronous logical replication, you've got to wait for the replay of
the entire transaction. That's a major bummer, especially if replay
is single-threaded and there a large number of backends generating
transactions. Of course, the 2PC dance itself can also add latency -
that's most likely to be the issue if the transactions are each very
short.

What I'd suggest is trying to measure where the latency is coming
from. You should be able to measure how much time each transaction
spends (a) executing, (b) preparing itself, (c) waiting for the replay
thread to begin replaying it, (d) waiting for the replay thread to
finish replaying it, and (e) committing. Separating (c) and (d) might
be a little bit tricky, but I bet it's worth putting some effort in,
because the answer is probably important to understanding what sort of
change will help here. If (c) is the problem, you might be able to
get around it by having multiple processes, though that only helps if
applying is slower than decoding. But if (d) is the problem, then the
only solution is probably to begin applying the transaction
speculatively before it's prepared/committed. I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 10 years ago

In reply to: Robert Haas (#2)

Re: Logical replication and multimaster

Thank you for reply.

On 12/02/2015 08:30 PM, Robert Haas wrote:

Logical decoding only begins decoding a transaction once the
transaction is complete. So I would guess that the sequence of
operations here is something like this - correct me if I'm wrong:

1. Do the transaction.
2. PREPARE.
3. Replay the transaction.
4. PREPARE the replay.
5. COMMIT PREPARED on original machine.
6. COMMIT PREPARED on replica.

Logical decoding is started after execution of XLogFlush method.
So atually transaction is not yet completed at this moment:
- it is not marked as committed in clog
- It is marked as in-progress in procarray
- locks are not released

We are not using PostgreSQL two-phase commit here.
Instead of our DTM catches control in TransactionIdCommitTree and sends request to arbiter which in turn wait status of committing transactions on replicas.
The problem is that transactions are delivered to replica through single channel: logical replication slot.
And while such transaction is waiting acknowledgement from arbiter, it is blocking replication channel preventing other (parallel transactions) from been replicated and applied.

I have implemented pool of background workers. May be it will be useful not only for me.
It consists of one produces-multiple consumers queue implemented using buffer in shared memory, spinlock and two semaphores.
API is very simple:

typedef void(*BgwPoolExecutor)(int id, void* work, size_t size);
typedef BgwPool*(*BgwPoolConstructor)(void);

extern void BgwPoolStart(int nWorkers, BgwPoolConstructor constructor);
extern void BgwPoolInit(BgwPool* pool, BgwPoolExecutor executor, char const* dbname, size_t queueSize);
extern void BgwPoolExecute(BgwPool* pool, void* work, size_t size);

You just place in this queue some bulk of bytes (work, size), it is placed in queue and then first available worker will dequeue it and execute.

Using this pool and larger number of accounts (reducing possibility of conflict), I get better results.
So now receiver of logical replication is not executing transactions directly, instead of it receiver is placing them in queue and them are executed concurrent by pool of background workers.

At cluster with three nodes results of out debit-credit benchmark are the following:

TPS
Multimaster (ACID transactions)
12500
Multimaster (async replication)
34800
Standalone PostgreSQL
44000

We tested two modes: when client randomly distribute queries between cluster nodes and when client is working only with one master nodes and other are just used as replicas. Performance is slightly better in the second case, but the difference is not very
large (about 11000 TPS in first case).

Number of workers in pool has signficant imact on performance: with 8 workers we get about 7800 TPS and with 16 workers - 12500.
Also performance greatly depends on number of accounts (and so probability of lock conflicts). In case of 100 accounts speed is less than 1000 TPS.

Show quoted text

Step 3 introduces latency proportional to the amount of work the
transaction did, which could be a lot. If you were doing synchronous
physical replication, the replay of the COMMIT record would only need
to wait for the replay of the commit record itself. But with
synchronous logical replication, you've got to wait for the replay of
the entire transaction. That's a major bummer, especially if replay
is single-threaded and there a large number of backends generating
transactions. Of course, the 2PC dance itself can also add latency -
that's most likely to be the issue if the transactions are each very
short.

What I'd suggest is trying to measure where the latency is coming
from. You should be able to measure how much time each transaction
spends (a) executing, (b) preparing itself, (c) waiting for the replay
thread to begin replaying it, (d) waiting for the replay thread to
finish replaying it, and (e) committing. Separating (c) and (d) might
be a little bit tricky, but I bet it's worth putting some effort in,
because the answer is probably important to understanding what sort of
change will help here. If (c) is the problem, you might be able to
get around it by having multiple processes, though that only helps if
applying is slower than decoding. But if (d) is the problem, then the
only solution is probably to begin applying the transaction
speculatively before it's prepared/committed. I think.

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Logical replication and multimaster

On 1 December 2015 at 00:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
wrote:

We have implemented ACID multimaster based on logical replication and our

DTM (distributed transaction manager) plugin.

What are you using for an output plugin and for replay?

I'd really like to collaborate using pglogical_output if at all possible.
Petr's working really hard to get the pglogical downstrem out too, with me
helping where I can.

I'd hate to be wasting time and effort working in parallel on overlapping
functionality. I did a LOT of work to make pglogical_output extensible and
reusable for different needs, with hooks used heavily instead of making
things specific to the pglogical downstream. A protocol documented in
detail. A json output mode as an option. Parameters for clients to
negotiate options. etc.

Would a different name for the upstream output plugin help?

And according to 2ndquadrant results, BDR performance is very close to hot
standby.

Yes... but it's asynchronous multi-master. Very different to what you're
doing.

I wonder if it is principle limitation of logical replication approach
which is efficient only for asynchronous replication or it can be somehow
tuned/extended to efficiently support synchronous replication?

I'm certain there are improvements to be made for synchronous replication.

We have also considered alternative approaches:

1. Statement based replication.

Just don't go there. Really.

It seems to be better to have one connection between nodes, but provide
parallel execution of received transactions at destination side.

I agree. This is something I'd like to be able to do through logical
decoding. As far as I can tell there's no fundamental barrier to doing so,
though there are a few limitations when streaming logical xacts:

- We can't avoid sending transactions that get rolled back

- We can't send the commit timestamp, commit LSN, etc at BEGIN time, so
last-update-wins
conflict resolution can't be done based on commit timestamp

- When streaming, the xid must be in each message, not just in begin/commit.

- The apply process can't use the SPI to apply changes directly since we
can't multiplex transactions. It'll need to use
shmem to communicate with a pool of workers, dispatching messages to
workers as they arrive. Or it can multiplex
a set of libpq connections in async mode, which I suspect may prove to be
better.

I've made provision for streaming support in the pglogical_output
extension. It'll need core changes to allow logical decoding to stream
changes though.

Separately, I'd also like to look at decoding and sending sequence
advances, which are something that happens outside transaction boundaries.

We have now in PostgreSQL some infrastructure for background works, but
there is still no abstraction of workers pool and job queue which can
provide simple way to organize parallel execution of some jobs. I wonder if
somebody is working now on it or we should try to propose our solution?

I think a worker pool would be quite useful to have.

For BDR and for pglogical we had to build an infrastructure on top of
static and dynamic bgworkers. A static worker launches a dynamic bgworker
for each database. The dynamic bgworker for the database looks at
extension-provided user catalogs to determine whether it should launch more
dynamic bgworkers for each connection to a peer node.

Because the bgworker argument is a single by-value Datum the argument
passed is an index into a static shmem array of structs. The struct is
populated with the target database oid (or name, for 9.4, due to bgworker
API limitations) and other info needed to start the worker.

Because registered static and dynamic bgworkers get restarted by the
postmaster after a crash/restart cycle, and the restarted static worker
will register new dynamic workers after restart, we have to jump through
some annoying hoops to avoid duplicate bgworkers. A generation counter is
stored in postmaster memory and incremented on crash recovery then copied
to shmem. The high bits of the Datum argument to the workers embeds the
generation counter. They compare their argument's counter to the one in
shmem and exit if the counter differs, so the relaunched old generation of
workers exits after a crash/restart cycle. See the thread on
BGW_NO_RESTART_ON_CRASH for details.

In pglogical we're instead using BGW_NEVER_RESTART workers and doing
restarts ourselves when needed, ignoring the postmaster's ability to
restart bgworkers when the worker crashes.

It's likely that most projects using bgworkers for this sort of thing will
need similar functionality, so generalizing it into a worker pool API makes
a lot of sense. In the process we could really use API to examine currently
registered and running bgworkers. Interested in collaborating on that?

Another thing I've wanted as part of this work is a way to get a one-time
authentication cookie from the server that can be passed as a libpq
connection option to get a connection without having to know a password or
otherwise mess with pg_hba.conf. Basically a way to say "I'm a bgworker
running with superuser rights within Pg, and I want to make a libpq
connection to this database. I'm inherently trusted, so don't mess with
pg_hba.conf and passwords, just let me in".

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#3)

Re: Logical replication and multimaster

On 3 December 2015 at 04:18, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
wrote:

The problem is that transactions are delivered to replica through single
channel: logical replication slot.
And while such transaction is waiting acknowledgement from arbiter, it is
blocking replication channel preventing other (parallel transactions) from
been replicated and applied.

Streaming interleaved xacts from the slot as discussed in the prior mail
would help there.

You'd continue to apply concurrent work from other xacts, and just handle
commit messages as they arrive, sending the confirmations back through the
DTM API.

I have implemented pool of background workers. May be it will be useful
not only for me.

Excellent.

It should be possible to make that a separate extension. You can use C
functions from other extensions by exposing a single pg_proc function with
'internal' return type that populates a struct of function pointers for the
API. A single DirectFunctionCall lets you get the API struct. That's how
pglogical_output handles hooks. The main downside is that you can't do that
without a connection to a database with the extension installed so the
pg_proc entry is exposed.

So it could make more sense to just keep it as a separate .c / .h file
that's copied into trees that use it. Simpler and easier, but uglier.

It consists of one produces-multiple consumers queue implemented using
buffer in shared memory, spinlock and two semaphores.

[snip]

You just place in this queue some bulk of bytes (work, size), it is placed
in queue and then first available worker will dequeue it and execute.

Very nice.

To handle xact streaming I think you're likely to need a worker dispatch
key too, where the dispatch keys are "sticky" to a given worker. So you
assign xid 1231 to a worker at BEGIN. Send all work to the pool and
everything with xid 1231 goes to that worker. At commit you clear the
assignment of xis 1231.

Alternately a variant of the Execute method that lets you dispatch to a
specific worker would do the job.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Robert Haas (#2)

Re: Logical replication and multimaster

On 3 December 2015 at 01:30, Robert Haas <robertmhaas@gmail.com> wrote:

1. Do the transaction.
2. PREPARE.
3. Replay the transaction.

As Konstantin noted they aren't using Pg's 2PC. They actually couldn't if
they wanted to because logical decoding does not support decoding an xact
at PREPARE TRANSACTION time, without COMMIT PREPARED.

I'd love to change that and allow decoding at PREPARE TRANSACTION time - or
streaming the xact from the start, as discussed in the prior mail. This
would be a huge help for doing consensus operations on an otherwise
asynchronous cluster, like making table structure changes. You'd decode the
prepared xact, replay it, prepare it on all nodes, then commit prepared
when all nodes confirm successful prepare.

IIRC the main issue with this is that the prepared xact continues to hold
locks so logical decoding can't acquire the locks it needs to decode the
xact.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 10 years ago

In reply to: Craig Ringer (#4)

Re: Logical replication and multimaster

On Dec 3, 2015, at 4:09 AM, Craig Ringer wrote:

On 1 December 2015 at 00:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

We have implemented ACID multimaster based on logical replication and our DTM (distributed transaction manager) plugin.

What are you using for an output plugin and for replay?

I have implemented output plugin for multimaster based on Michael's decoder_raw+receiver_raw.
Right now it decodes WAL into correspondent SQL insert/update statements.
Certainly it is very inefficient way and in future I will replace it with some binary protocol, as it is used for example in BDR
(but BDR plugin contains a lot of stuff related with detecting and handling conflicts which is not relevant for multimaster).
But right now performance of Multimaster is not limited by logical replication protocol - if I remove DTM and use asynchronous replication (lightweight version of BDR:)
then I get 38k TPS instead of 12k.

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.

I'd hate to be wasting time and effort working in parallel on overlapping functionality. I did a LOT of work to make pglogical_output extensible and reusable for different needs, with hooks used heavily instead of making things specific to the pglogical downstream. A protocol documented in detail. A json output mode as an option. Parameters for clients to negotiate options. etc.

Would a different name for the upstream output plugin help?

And where I can get pglogical_output plugin? Sorry, but I can't quickly find reference with Google...
Also I wonder if this plugin perform DDL replication (most likely not). But then naive question - why DDL was excluded from logical replication protocol?
Are there some principle problems with it? In BDR it was handled in alternative way, using executor callback. It will be much easier if DDL can be replicated in the same way as normal SQL statements.

Show quoted text

And according to 2ndquadrant results, BDR performance is very close to hot standby.

Yes... but it's asynchronous multi-master. Very different to what you're doing.

I wonder if it is principle limitation of logical replication approach which is efficient only for asynchronous replication or it can be somehow tuned/extended to efficiently support synchronous replication?

I'm certain there are improvements to be made for synchronous replication.

We have also considered alternative approaches:
1. Statement based replication.

Just don't go there. Really.

It seems to be better to have one connection between nodes, but provide parallel execution of received transactions at destination side.

I agree. This is something I'd like to be able to do through logical decoding. As far as I can tell there's no fundamental barrier to doing so, though there are a few limitations when streaming logical xacts:

- We can't avoid sending transactions that get rolled back

- We can't send the commit timestamp, commit LSN, etc at BEGIN time, so last-update-wins
conflict resolution can't be done based on commit timestamp

- When streaming, the xid must be in each message, not just in begin/commit.

- The apply process can't use the SPI to apply changes directly since we can't multiplex transactions. It'll need to use
shmem to communicate with a pool of workers, dispatching messages to workers as they arrive. Or it can multiplex
a set of libpq connections in async mode, which I suspect may prove to be better.

I've made provision for streaming support in the pglogical_output extension. It'll need core changes to allow logical decoding to stream changes though.

Separately, I'd also like to look at decoding and sending sequence advances, which are something that happens outside transaction boundaries.

We have now in PostgreSQL some infrastructure for background works, but there is still no abstraction of workers pool and job queue which can provide simple way to organize parallel execution of some jobs. I wonder if somebody is working now on it or we should try to propose our solution?

I think a worker pool would be quite useful to have.

For BDR and for pglogical we had to build an infrastructure on top of static and dynamic bgworkers. A static worker launches a dynamic bgworker for each database. The dynamic bgworker for the database looks at extension-provided user catalogs to determine whether it should launch more dynamic bgworkers for each connection to a peer node.

Because the bgworker argument is a single by-value Datum the argument passed is an index into a static shmem array of structs. The struct is populated with the target database oid (or name, for 9.4, due to bgworker API limitations) and other info needed to start the worker.

Because registered static and dynamic bgworkers get restarted by the postmaster after a crash/restart cycle, and the restarted static worker will register new dynamic workers after restart, we have to jump through some annoying hoops to avoid duplicate bgworkers. A generation counter is stored in postmaster memory and incremented on crash recovery then copied to shmem. The high bits of the Datum argument to the workers embeds the generation counter. They compare their argument's counter to the one in shmem and exit if the counter differs, so the relaunched old generation of workers exits after a crash/restart cycle. See the thread on BGW_NO_RESTART_ON_CRASH for details.

In pglogical we're instead using BGW_NEVER_RESTART workers and doing restarts ourselves when needed, ignoring the postmaster's ability to restart bgworkers when the worker crashes.

It's likely that most projects using bgworkers for this sort of thing will need similar functionality, so generalizing it into a worker pool API makes a lot of sense. In the process we could really use API to examine currently registered and running bgworkers. Interested in collaborating on that?

Another thing I've wanted as part of this work is a way to get a one-time authentication cookie from the server that can be passed as a libpq connection option to get a connection without having to know a password or otherwise mess with pg_hba.conf. Basically a way to say "I'm a bgworker running with superuser rights within Pg, and I want to make a libpq connection to this database. I'm inherently trusted, so don't mess with pg_hba.conf and passwords, just let me in".

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 10 years ago

In reply to: Craig Ringer (#5)

Re: Logical replication and multimaster

On Dec 3, 2015, at 4:18 AM, Craig Ringer wrote:

Excellent.

It should be possible to make that a separate extension. You can use C functions from other extensions by exposing a single pg_proc function with 'internal' return type that populates a struct of function pointers for the API. A single DirectFunctionCall lets you get the API struct. That's how pglogical_output handles hooks. The main downside is that you can't do that without a connection to a database with the extension installed so the pg_proc entry is exposed.

Actually, working under cluster and columnar storage extension I got several questions about PostgreSQL infrastructure.
I always found some workarounds, but may it is better to ask community about it:)

1. Why there is no "conditional event" synchronization primitive in PostgreSQL. There is latch, but it is implemented using sockets and I afraid that it is not very fast.
It will be nice to have some fast primitive like pthread condition variables.

2. PostgreSQL semaphores seems to be not intended for external use outside PostgreSQL core (for example in extensions).
There is no way to request additional amount of semaphores. Right now semaphores are allocated based on maximal number of backends and spinlocks.
And a semaphore as well as event is very popular and convenient synchronization primitive required in many cases.

3. What is the right way of creation of background worker requiring access to shared memory, i.e. having control structure in main memory?
As far as I understand background workers have to be registered either PG_init, either outside Postmaster environment.
If extension requires access to shared memory, then it should be registered in shared_preload_libraries list and should be initialized using shmem_startup hook.
Something like this:

void _PG_init(void)
{
if (!process_shared_preload_libraries_in_progress)
return;
...
prev_shmem_startup_hook = shmem_startup_hook;
shmem_startup_hook = My_shmem_startup;
}

My_shmem_startup is needed because in _PG_init it is not possible to allocate shared memory.
So if I need to allocate some control structure for background workers in shared memory, then I should do it in My_shmem_startup.
But I can not register background workers in My_shmem_startup! I will get "must be registered in shared_preload_libraries" error:

void
RegisterBackgroundWorker(BackgroundWorker *worker)
{
if (!process_shared_preload_libraries_in_progress)
{
if (!IsUnderPostmaster)
ereport(LOG,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("background worker \"%s\": must be registered in shared_preload_libraries",
worker->bgw_name)));
return;
}
}

So I have to register background workers in PG_init while control structure for them is not yet ready.
When I have implemented pool of background workers, I solved this problem by proving function which return address of control structure later - when it will be actually allocated.
But it seems to be some design flaw in BGW, isn' it?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#7)

Re: Logical replication and multimaster

On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru>
wrote:

I'd really like to collaborate using pglogical_output if at all possible.
Petr's working really hard to get the pglogical downstrem out too, with me
helping where I can.

And where I can get pglogical_output plugin? Sorry, but I can't quickly
find reference with Google...

It's been submitted to this CF.

https://commitfest.postgresql.org/7/418/

https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output

Any tests and comments would be greatly appreciated.

I have a version compatible with 9.4 and older in a separate tree I want to
make public. I'll get back to you on that later today. It's the same code
with a few more ifdefs and an uglier structure for the example hooks module
(because it can be a separate contrib)¸so it's not that exciting.

You should be able to just "git remote add" that repo, "git fetch" and "git
merge dev/pglogical-output" into your working tree.

Also I wonder if this plugin perform DDL replication (most likely not).

No, it doesn't. The way it's done in BDR is too intrusive and has to be
reworked before it can be made more generally re-usable.

How I envision DDL replication working for pglogical (or anything else) is
to take the DDL hooks added in 9.5 and use them with a separate DDL deparse
extension based on Álvaro's deparse work. If you want to replicate DDL you
make sure this extension is loaded then use it from your event triggers to
capture DDL in a useful form and write it to a queue table where your
downstream client can find it and consume it. That way the deparse code
doesn't have to be embedded in the Pg backend like it is in BDR, and
instead can be a reusable extension.

But then naive question - why DDL was excluded from logical replication

protocol?

logical decoding can't handle DDL because all it sees is the effects of
that DDL in the xlog as a series of changes to catalog tables, relfilenode
changes, etc. It can't turn that back into the original DDL in any kind of
reliable way. A downstream can't do very much with "rename relfilenode 1231
to 1241".

There are a few cases we might want to handle through decoding - in
particular I'd like to be able to decode changes to rows in shared catalogs
like pg_authid, since we can't handle that with DDL deparse. For things
like DROP TABLE, CREATE TABLE, etc we really need DDL hooks. At least as I
currently understand things.

So we try to capture DDL at a higher level. That's why event triggers were
added (http://www.postgresql.org/docs/current/static/event-triggers.html)
and why DDL deparse was implemented (
https://commitfest-old.postgresql.org/action/patch_view?id=1610).

You can't just capture the raw DDL statement since there are issues with
search_path normalization, etc. Similar problems to statement based
replication exist. Deparse is required to get the DDL after it's converted
to a utility statement so we can obtain it in an unambiguous form.

I'll add some explanation in pglogical_output's DESIGN.md for why DDL is
not currently handled.

BTW, TRUNCATE _is_ handled by the way. In pglogical we use regular TRUNCATE
triggers (marked tgisinternal) for that. There are some significant
complexities around foreign keys, sequence reset, etc, which are not fully
handled yet.

Are there some principle problems with it? In BDR it was handled in

alternative way, using executor callback. It will be much easier if DDL can
be replicated in the same way as normal SQL statements.

It can't. I wish it could.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

Shulgin, Oleksandr

oleksandr.shulgin@zalando.de

over 10 years ago

In reply to: Craig Ringer (#9)

Re: Logical replication and multimaster

On Thu, Dec 3, 2015 at 8:34 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <
k.knizhnik@postgrespro.ru> wrote:

Are there some principle problems with it? In BDR it was handled in
alternative way, using executor callback. It will be much easier if DDL can
be replicated in the same way as normal SQL statements.

It can't. I wish it could.

That reminds me of that DDL deparsing patch I was trying to revive a while
ago. Strangely, I cannot find it in any of the commit fests. Will add it.

--
Alex

#11

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 10 years ago

In reply to: Craig Ringer (#9)

Re: Logical replication and multimaster

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.

And where I can get pglogical_output plugin? Sorry, but I can't quickly find reference with Google...

It's been submitted to this CF.

https://commitfest.postgresql.org/7/418/

https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output

Any tests and comments would be greatly appreciated.

Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - analog of receiver_raw?

Show quoted text

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#11)

Re: Logical replication and multimaster

On 3 December 2015 at 12:06, konstantin knizhnik <k.knizhnik@postgrespro.ru>
wrote:

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <
k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible.
Petr's working really hard to get the pglogical downstrem out too, with me
helping where I can.

And where I can get pglogical_output plugin? Sorry, but I can't quickly
find reference with Google...

It's been submitted to this CF.

https://commitfest.postgresql.org/7/418/

https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output

Any tests and comments would be greatly appreciated.

Thank you.
I wonder if there is opposite part of the pipe for pglogical_output -
analog of receiver_raw?

Yes, there is. pglogical is currently in test and will be available
sometime soon.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Logical replication and multimaster

On 30 November 2015 at 17:20, Konstantin Knizhnik <k.knizhnik@postgrespro.ru

wrote:

But looks like there is not so much sense in having multiple network
connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide
parallel execution of received transactions at destination side. But it
seems to be also nontrivial. We have now in PostgreSQL some infrastructure
for background works, but there is still no abstraction of workers pool and
job queue which can provide simple way to organize parallel execution of
some jobs. I wonder if somebody is working now on it or we should try to
propose our solution?

There are definitely two clear places where additional help would be useful
and welcome right now.

1. Allowing logical decoding to have a "speculative pre-commit data"
option, to allow some data to be made available via the decoding api,
allowing data to be transferred prior to commit. This would allow us to
reduce the delay that occurs at commit, especially for larger transactions
or very low latency requirements for smaller transactions. Some heuristic
or user interface would be required to decide whether to and which
transactions might make their data available prior to commit. And we would
need to send abort messages should the transactions not commit as expected.
That would be a patch on logical decoding and is an essentially separate
feature to anything currently being developed.

2. Some mechanism/theory to decide when/if to allow parallel apply. That
could be used for both physical and logical replication. Since the apply
side of logical replication is still being worked on there is a code
dependency there, so a working solution isn't what is needed yet. But the
general principles and any changes to the data content (wal_level) or
protocol (pglogical_output) would be useful.

We already have working multi-master that has been contributed to PGDG, so
contributing that won't gain us anything. There is a lot of code and
pglogical is the most useful piece of code to be carved off and reworked
for submission. The bottleneck is review and commit, not initial
development - which applies both to this area and most others in PostgreSQL.

Having a single network connection between nodes would increase efficiency
but also increase replication latency, so its not useful in all cases.

I think having some kind of message queue between nodes would also help,
since there are many cases for which we want to transfer data, not just a
replication data flow. For example, consensus on DDL, or MPP query traffic.
But that is open to wider debate.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#8)

Re: Logical replication and multimaster

On 3 December 2015 at 15:27, konstantin knizhnik <k.knizhnik@postgrespro.ru>
wrote:

On Dec 3, 2015, at 4:18 AM, Craig Ringer wrote:

Excellent.

It should be possible to make that a separate extension. You can use C

functions from other extensions by exposing a single pg_proc function with
'internal' return type that populates a struct of function pointers for the
API. A single DirectFunctionCall lets you get the API struct. That's how
pglogical_output handles hooks. The main downside is that you can't do that
without a connection to a database with the extension installed so the
pg_proc entry is exposed.

Actually, working under cluster and columnar storage extension I got
several questions about PostgreSQL infrastructure.
I always found some workarounds, but may it is better to ask community
about it:)

1. Why there is no "conditional event" synchronization primitive in
PostgreSQL. There is latch, but it is implemented using sockets and I
afraid that it is not very fast.
It will be nice to have some fast primitive like pthread condition
variables.

The need for IPC makes things a bit more complex. Most places can get away
with using a latch, testing one or more conditions, and resuming waiting.

While what you describe sounds possibly nice is there any evidence that
it's a bottleneck or performance issue? Or is this premature optimisation
at work?

2. PostgreSQL semaphores seems to be not intended for external use outside
PostgreSQL core (for example in extensions).
There is no way to request additional amount of semaphores. Right now
semaphores are allocated based on maximal number of backends and spinlocks.

Same with spinlocks AFAIK.

You can add your own LWLocks though.

3. What is the right way of creation of background worker requiring access
to shared memory, i.e. having control structure in main memory?

This is documented and well established.

As far as I understand background workers have to be registered either
PG_init, either outside Postmaster environment.
If extension requires access to shared memory, then it should be
registered in shared_preload_libraries list and should be initialized using
shmem_startup hook.

Correct.

You can use dynamic shmem instead, but there are some issues there IIRC.
Petr may have more to say there.

Take a look at the BDR code for some examples, and there are some in
contrib too I think.

My_shmem_startup is needed because in _PG_init it is not possible to

allocate shared memory.

Correct, since it's in early postmaster start.

So if I need to allocate some control structure for background workers in
shared memory, then I should do it in My_shmem_startup.

Yes.

But I can not register background workers in My_shmem_startup!

Correct. Register static bgworkers in _PG_init. Register dynamic bgworkers
later, in a normal backend function or a bgworker main loop.

So I have to register background workers in PG_init while control
structure for them is not yet ready.

Correct.

They aren't *started* until after shmem init, though.

When I have implemented pool of background workers, I solved this problem
by proving function which return address of control structure later - when
it will be actually allocated.

Beware of EXEC_BACKEND. You can't assume you have shared postmaster memory
from fork().

I suggest that you allocate a static shmem array. Pass indexes into it as
the arguments to the bgworkers. Have them look up their index in the array
to get their struct pointer.

Read the BDR code to see how this can work; see bdr_perdb.c, bdr_apply.c,
etc's bgworker main loops, bdr_supervisor.c and bdr_perdb.c's code for
registering dynamic bgworkers, and the _PG_init function's setup of the
static supervisor bgworker.

In your case I think you should probably be using dynamic bgworkers for
your pool anyway, so you can grow and shrink them as-needed.

But it seems to be some design flaw in BGW, isn' it?

I don't think so. You're registering the worker, saying "when you're ready
please start this". You're not starting it.

You can use dynamic bgworkers too. Same deal, you register them and the
postmaster starts them in a little while, but you can register them after
_PG_init.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#11)

Re: Logical replication and multimaster

On 3 December 2015 at 19:06, konstantin knizhnik <k.knizhnik@postgrespro.ru>
wrote:

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <
k.knizhnik@postgrespro.ru> wrote:

I'd really like to collaborate using pglogical_output if at all possible.
Petr's working really hard to get the pglogical downstrem out too, with me
helping where I can.

And where I can get pglogical_output plugin? Sorry, but I can't quickly
find reference with Google...

It's been submitted to this CF.

https://commitfest.postgresql.org/7/418/

https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output

Any tests and comments would be greatly appreciated.

Thank you.
I wonder if there is opposite part of the pipe for pglogical_output -
analog of receiver_raw?

It's pglogical, and it's in progress, due to be released at the same time
as 9.5. We're holding it a little longer to nail down the user interface a
bit better, etc, and because sometimes the real world gets in the way.

The catalogs and UI are very different to BDR, it's much more
extensible/modular, it supports much more flexible topologies, etc... but
lots of the core concepts are very similar. So if you go take a look at the
BDR code that'll give you a pretty solid idea of how a lot of it works,
though BDR has whole subsystems pglogical doesn't (global ddl lock, ddl
replication, etc).

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#16

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Simon Riggs (#13)

Re: Logical replication and multimaster

On 3 December 2015 at 20:39, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 November 2015 at 17:20, Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:

But looks like there is not so much sense in having multiple network
connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide
parallel execution of received transactions at destination side. But it
seems to be also nontrivial. We have now in PostgreSQL some infrastructure
for background works, but there is still no abstraction of workers pool and
job queue which can provide simple way to organize parallel execution of
some jobs. I wonder if somebody is working now on it or we should try to
propose our solution?

There are definitely two clear places where additional help would be
useful and welcome right now.

Three IMO, in that a re-usable, generic bgworker pool driven by shmem
messaging would be quite handy. We'll want something like that when we have
transaction interleaving.

I think Konstantin's design is a bit restrictive at the moment; at the
least it needs to address sticky dispatch, and it almost certainly needs to
be using dynamic bgworkers (and maybe dynamic shmem too) to be flexible.
Some thought will be needed to make sure it doesn't rely on !EXEC_BACKEND
stuff like passing pointers to fork()ed data from postmaster memory too.
But the general idea sounds really useful, and we'll either need that or to
use async libpq for concurrent apply.

1. Allowing logical decoding to have a "speculative pre-commit data"
option, to allow some data to be made available via the decoding api,
allowing data to be transferred prior to commit.

Petr, Andres and I tended to refer to that as interleaved transaction
streaming. The idea being to send changes from multiple xacts mixed
together in the stream, identifed by an xid sent with each message, as we
decode them from WAL. Currently we add them to a local reorder buffer and
send them only in commit order after commit.

This moves responsibility for xact ordering (and buffering, if necessary)
to the downstream. It introduces the possibility that concurrently replayed
xacts could deadlock with each other and a few exciting things like that,
too, but with the payoff that we can continue to apply small transactions
in a timely manner even as we're streaming a big transaction like a COPY.

We could possibly enable interleaving right from the start of the xact, or
only once it crosses a certain size threshold. For your purposes Konstantin
you'd want to do it right from the start since latency is crucial for you.
For pglogical we'd probably want to buffer them a bit and only start
streaming if they got big.

This would allow us to reduce the delay that occurs at commit, especially

for larger transactions or very low latency requirements for smaller
transactions. Some heuristic or user interface would be required to decide
whether to and which transactions might make their data available prior to
commit.

I imagine we'd have a knob, either global or per-slot, that sets a
threshold based on size in bytes of the buffered xact. With 0 allowed as
"start immediately".

And we would need to send abort messages should the transactions not
commit as expected. That would be a patch on logical decoding and is an
essentially separate feature to anything currently being developed.

I agree that this is strongly desirable. It'd benefit anyone using logical
decoding and would have wide applications.

2. Some mechanism/theory to decide when/if to allow parallel apply.

I'm not sure it's as much about allowing it as how to do it.

We already have working multi-master that has been contributed to PGDG, so
contributing that won't gain us anything.

Namely BDR.

There is a lot of code and pglogical is the most useful piece of code to
be carved off and reworked for submission.

Starting with the already-published output plugin, with the downstream to
come around the release of 9.5.

Having a single network connection between nodes would increase efficiency
but also increase replication latency, so its not useful in all cases.

If we interleave messages I'm not sure it's too big a problem. Latency
would only become an issue there if a big single row (big Datum contents)
causes lots of small work to get stuck behind it.

IMO this is a separate issue to be dealt with later.

I think having some kind of message queue between nodes would also help,

since there are many cases for which we want to transfer data, not just a
replication data flow. For example, consensus on DDL, or MPP query traffic.
But that is open to wider debate.

Logical decoding doesn't really define any network protocol at all. It's
very flexible, and we can throw almost whatever we want down it. The
pglogical_output protocol is extensible enough that we can just add
additional messages when we need to, making them opt-in so we don't break
clients that don't understand them.

I'm likely to need to do that soon for sequence-advance messages if I can
get logical decoding of sequence advance working.

We might want a way to queue those messages at a particular LSN, so we can
use them for replay barriers etc and ensure they're crash-safe. Like the
generic WAL messages used in BDR and proposed for core. Is that what you're
getting at? WAL messages would certainly be nice, but I think we can mostly
if not entirely avoid the need for them if we have transaction interleaving
and concurrent transaction support.

Somewhat related, I'd quite like to be able to send messages from
downstream back to upstream, where they're passed to a hook on the logical
decoding plugin. That'd eliminate the need to do a whole bunch of stuff
that currently has to be done using direct libpq connections or a second
decoding slot in the other direction. Basically send a CopyData packet in
the other direction and have its payload passed to a new hook on output
plugins.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Petr Jelinek

petr@2ndquadrant.com

over 10 years ago

In reply to: Craig Ringer (#14)

Re: Logical replication and multimaster

On 2015-12-03 14:32, Craig Ringer wrote:

On 3 December 2015 at 15:27, konstantin knizhnik
<k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

3. What is the right way of creation of background worker requiring
access to shared memory, i.e. having control structure in main memory?

This is documented and well established.

As far as I understand background workers have to be registered
either PG_init, either outside Postmaster environment.
If extension requires access to shared memory, then it should be
registered in shared_preload_libraries list and should be
initialized using shmem_startup hook.

Correct.

You can use dynamic shmem instead, but there are some issues there IIRC.
Petr may have more to say there.
Take a look at the BDR code for some examples, and there are some in
contrib too I think.

If you have your own flock of dynamic workers that you manage yourself,
it's probably easier to use dynamic shared memory. You can see some
examples in the tests and also in the parallel query code for how to do
it. The only real issue we faced with using dynamic shared memory was
that we needed to do IPC from normal backends and that gets complicated
when you don't have the worker info in the normal shmem.

The registration timing and working with normal shmem is actually not a
problem. Just register shmem start hook in _PG_init and if you are
registering any bgworkers there as well make sure you set bgw_start_time
correctly (usually what you want is BgWorkerStart_RecoveryFinished).
Then you'll have the shmem hook called before the bgworker is actually
started.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Simon Riggs (#13)

Re: Logical replication and multimaster

On 3 December 2015 at 20:39, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 November 2015 at 17:20, Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:

But looks like there is not so much sense in having multiple network
connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide
parallel execution of received transactions at destination side. But it
seems to be also nontrivial. We have now in PostgreSQL some infrastructure
for background works, but there is still no abstraction of workers pool and
job queue which can provide simple way to organize parallel execution of
some jobs. I wonder if somebody is working now on it or we should try to
propose our solution?

There are definitely two clear places where additional help would be
useful and welcome right now.

1. Allowing logical decoding to have a "speculative pre-commit data"
option, to allow some data to be made available via the decoding api,
allowing data to be transferred prior to commit.

Something relevant I ran into re this:

in reorderbuffer.c, on ReorderBufferCommit:

* We currently can only decode a transaction's contents in when their
commit
* record is read because that's currently the only place where we know
about
* cache invalidations. Thus, once a toplevel commit is read, we iterate
over
* the top and subtransactions (using a k-way merge) and replay the
changes in
* lsn order.

I haven't dug into the implications particularly as I'm chasing something
else, but want to note it on the thread. Here be dragons when it comes to
transaction streaming.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 10 years ago

In reply to: Craig Ringer (#15)

Re: Logical replication and multimaster

Hi,

I have integrated pglogical_output in multimaster, using bdr_apply from BDR as template for implementation of receiver part.
The time of insert is reduced almost 10 times comparing with logical replication based on decoder_raw/receiver_raw plugins which performs logical replication using SQL statements. But unfortunately time of updates is almost not changed.
It is expected result because I didn't see any functions related with SQL parsing/preparing in profile.
Now in both cases profile is similar:

4.62% postgres [.] HeapTupleSatisfiesMVCC
2.99% postgres [.] heapgetpage
2.10% postgres [.] hash_search_with_hash_value
1.86% postgres [.] ExecProject
1.80% postgres [.] heap_getnext
1.79% postgres [.] PgXidInMVCCSnapshot

By the way, you asked about comments concerning pglogical_output. I have one: most of pglogical protocol functions have "PGLogicalOutputData *data" parameter. There are few exceptions:

write_startup_message_fn, pglogical_write_origin_fn, pglogical_write_rel_fn

PGLogicalOutputData is the only way to pass protocol specific data, using "PGLogicalProtoAPI *api" field.
This field is assigned by pglogical_init_api() function. And I can extend this PGLogicalProtoAPI structure by adding some protocol specific fields.
For example, this is how it is done now for multimaster:

typedef struct PGLogicalProtoMM
{
PGLogicalProtoAPI api;
bool isLocal; /* mark transaction as local */
} PGLogicalProtoMM;

PGLogicalProtoAPI *
pglogical_init_api(PGLogicalProtoType typ)
{
PGLogicalProtoMM* pmm = palloc0(sizeof(PGLogicalProtoMM));
PGLogicalProtoAPI* res = &pmm->api;
pmm->isLocal = false;
res->write_rel = pglogical_write_rel;
res->write_begin = pglogical_write_begin;
res->write_commit = pglogical_write_commit;
res->write_insert = pglogical_write_insert;
res->write_update = pglogical_write_update;
res->write_delete = pglogical_write_delete;
res->write_startup_message = write_startup_message;
return res;
}

But I have to add "PGLogicalOutputData *data" parameter to pglogical_write_rel_fn function.
Di you think that it will be better to pass this parameter to all functions?

May be it is not intended way of passing custom data to this functions...
Certainly it is possible to use static variables for this purpose.
But I think that passing user specific data through PGLogicalOutputData is safer and more flexible solution.

Show quoted text

On 12/03/2015 04:53 PM, Craig Ringer wrote:

On 3 December 2015 at 19:06, konstantin knizhnik <k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

On 3 December 2015 at 14:54, konstantin knizhnik <k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

I'd really like to collaborate using pglogical_output if at all possible. Petr's working really hard to get the pglogical downstrem out too, with me helping where I can.

And where I can get pglogical_output plugin? Sorry, but I can't quickly find reference with Google...

It's been submitted to this CF.

https://commitfest.postgresql.org/7/418/

https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output

Any tests and comments would be greatly appreciated.

Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - analog of receiver_raw?

It's pglogical, and it's in progress, due to be released at the same time as 9.5. We're holding it a little longer to nail down the user interface a bit better, etc, and because sometimes the real world gets in the way.

The catalogs and UI are very different to BDR, it's much more extensible/modular, it supports much more flexible topologies, etc... but lots of the core concepts are very similar. So if you go take a look at the BDR code that'll give you a pretty solid
idea of how a lot of it works, though BDR has whole subsystems pglogical doesn't (global ddl lock, ddl replication, etc).
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Konstantin Knizhnik (#19)

Re: Logical replication and multimaster

On 7 December 2015 at 01:39, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
wrote:

I have integrated pglogical_output in multimaster

Excellent.

I just pushed a change to pglogical_output that exposes the row contents
(and the rest of the reorder change buffer contents) to hooks that want it,
by the way.

using bdr_apply from BDR as template for implementation of receiver part.

Yep, that'll tide you over. We're working hard on getting the downstream
part ready and you'll find it more flexible.

The time of insert is reduced almost 10 times comparing with logical
replication based on decoder_raw/receiver_raw plugins which performs
logical replication using SQL statements. But unfortunately time of updates
is almost not changed.

That's not too surprising, given that you'll have significant overheads for
checking if keys are present when doing updates.

This field is assigned by pglogical_init_api() function. And I can extend

this PGLogicalProtoAPI structure by adding some protocol specific fields.

Yep, that's the idea.

typedef struct PGLogicalProtoMM
{
PGLogicalProtoAPI api;
bool isLocal; /* mark transaction as local */
} PGLogicalProtoMM;

I'm curious about what you're using the 'isLocal' field for.

For MM you should only need to examine the replication origin assigned to
the transaction to determine whether you're going to forward it or not.

Were you not able to achieve what you wanted with a hook? If not, then we
might need another hook. Could you explain what it's for in more detail?

What I suggest is: have your downstream client install a pglogical_output
hook for the transaction filter hook. There, examine the replication origin
passed to the hook. If you want to forward locally originated xacts only
(such as for mesh multimaster) you can just filter out everything where the
origin is not InvalidRepOriginId. There are example hooks in
contrib/pglogical_output_plhooks .

There'll be a simple MM example using filter hooks in the pglogical
downstream btw and we're working hard to get that out.

But I have to add "PGLogicalOutputData *data" parameter to
pglogical_write_rel_fn function.
Do you think that it will be better to pass this parameter to all
functions?

Yes, I agree that it should be passed to the API for the output protocol.
It's pretty harmless. Please feel free to send a pull req.

Note that we haven't made that pluggable from the outside though; there's
no way to load a new protocol distributed separately from pglogical_output.
The idea is really to make sure that between the binary protocol and json
protocol we meet the reasonably expected set of use cases and don't need
pluggable protocols. Perhaps that's over-optimistic, but we've already got
and output plugin that has plug-in hooks, a plugin for a plugin. Do we need
another? Also, if we allow dynamic loading of new protocols then that means
we'll have a much harder time changing the protocol implementation API
later, so it's not something I'm keen to do. Also, to make it secure to
allow users to specify the protocol we'd have to make protocols implement
an extension with a C function in pg_proc to return its API struct, like we
do for hooks. So there'd be more hoop-jumping required to figure out how to
talk to the client.

If possible I'd like to find any oversights and omissions in the current
protocol and its APIs to meet future use cases without having to introduce
protocol plugins for an output plugin.

May be it is not intended way of passing custom data to this functions...

Yeah, we weren't really thinking of the protocol API as intended to be
pluggable and extensible. If you define new protocols you have to change
the rest of the output plugin code anyway.

Lets look at what protocol changes are needed to address your use case and
see whether it's necessary to take the step of making the protocol fully
pluggable or not.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21

Craig Ringer

craig@2ndquadrant.com

over 10 years ago

In reply to: Simon Riggs (#13)

#22

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Craig Ringer (#21)

#23