Replication identifiers, take 3
Hi,
I've previously started two threads about replication identifiers. Check
http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
and
http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
.
The've also been discussed in the course of another thread:
http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de
As the topic has garnered some heat and confusion I thought it'd be
worthwile to start afresh with an explanation why I think they're
useful.
I don't really want to discuss about implementation specifics for now,
but rather about (details of the) concept. Once we've hashed those out,
I'll adapt the existing patch to match them.
There are three primary use cases for replication identifiers:
1) The ability Monitor how for replication has progressed in a
crashsafe manner to allow it to restart at the right point after
errors/crashes.
2) Efficiently identify the origin of individual changes and
transactions. In multimaster and some cascading scenarios it is
necessary to do so to avoid sending out replayed changes again.
3) Allow to efficiently filter out replayed changes from logical
decoding. It's currently possible to filter changes from inside the
output plugin, but it's much more efficient to filter them out
before decoding.
== Logical Decoding Background ==
To understand the need for 1) it's important to roughly understand how
logical decoding/walsender streams changes and handles feedback from
the receiving side. A walsender performing logical decoding
*continously* sends out transactions. As long as there's new local
changes (i.e. unprocessed WAL) and the network buffers aren't full it
will send changes. *Without* waiting for the client. Everything else
would lead to horrible latency.
Because it sends data without waiting for the client to have processed
them it obviously can't remove resources that are needed to stream
them out again. The client or network connection could crash after
all.
To let the sender know when it can remove resources the receiver
regularly sends back 'feedback' messages acknowledging up to where
changes have been safely received. Whenever such a feedback message
arrives the sender can release resources that aren't needed to decode
the changes below that horizon.
When the receiver ask the server to stream changes out it tells the
sender at which LSN it should start sending changes. All
*transactions* that *commit* after that LSN are sent out. Possibly
again.
== Crashsafe apply ==
Based on those explanations, when building a logical replication
solution on top of logical decoding, one must remember the latest
*remote* LSN that already has been replayed. So that, when the apply
process or the whole database crashes, it is possibly to ask for all
changes since the last transaction that has been successfully applied.
The trivial solution here is to have a table (remote_node,
last_replayed_lsn) and update it for every replayed
transaction. Unfortunately that doesn't perform very well because that
table quickly gets heavily bloated. It's also hard to avoid page level
contention when replaying transaction from multiple remote
nodes. Additionally these changes have to be filtered out when
replicating these changes in a cascading fashion.
To do this more efficiently there needs to be a crashsafe way to
associate the latest successfully replayed remote transaction.
== Identify the origin of changes ==
Say you're building a replication solution that allows two nodes to
insert into the same table on two nodes. Ignoring conflict resolution
and similar fun, one needs to prevent the same change being replayed
over and over. In logical replication the changes to the heap have to
be WAL logged, and thus the *replay* of changes from a remote node
produce WAL which then will be decoded again.
To avoid that it's very useful to tag individual changes/transactions
with their 'origin'. I.e. mark changes that have been directly
triggered by the user sending SQL as originating 'locally' and changes
originating from replaying another node's changes as originating
somewhere else.
If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.
It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.
== Proposed solution ==
These two fundamental problems happen to have overlapping
requirements.
A rather efficient solution for 1) is to attach the 'origin node' and
the remote commit LSN to every local commit record produced by
replay. That allows to have a shared memory "table" (remote_node,
local_lsn, remote_lsn).
During replay that table is kept up2date in sync with transaction
commits. If updated within the transaction commit's critical section
it's guaranteed to be correct, even if transactions can abort due to
constraint violations and such. When the cluster crashes it can be
rebuilt during crash recovery, by updating values whenever a commit
record is read.
The primary complexity here is that the identification of the
'origin' node should be as small as possible to keep the WAL volume
down.
Similarly, to solve the problem of identifying the origin of changes
during decoding, the problem can be solved nicely by adding the origin
node of every change to changes/transactions. At first it might seem
to be sufficient to do so on transaction level, but for cascading
scenarios it's very useful to be able to have changes from different
source transactions combinded into a larger one.
Again the primary problem here is how to efficiently identify origin
nodes.
== Replication Identifiers ==
The above explains the need to have as small as possible identifiers
for nodes. Two years back I'd suggested that we rely on the user to
manually assign 16bit ids to individual nodes. Not very surprisingly
that was shot down because a) 16bit numbers are not descriptive b) a
per node identifier is problematic because it prohibits replication
inside the same cluster.
So, what I've proposed since is to have two different forms of
identifiers. A long one, that's as descriptive as
$replication_solution wants. And a small one (16bit in my case) that's
*only meaningful within one node*. The long, user facing, identifier
is locally mapped to the short one.
In the first version I proposed these long identifiers had a fixed
form, including the system identifier, timeline id, database id, and a
freeform name. That wasn't well received and I agree that that's too
narrow. I think it should be a freeform text of arbitrary length.
Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.
What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.
There's a example of how this can be used from SQL at
http://archives.postgresql.org/message-id/20131114172632.GE7522@alap2.anarazel.de
That version is built on top of commit timestamps, but that only shows
because pg_replication_identifier_setup_tx_origin() allows to set the
source transaction's timestamp.
With that, far too long, explanation, is it clearer what I think
replication identifiers are for? What's your thougts?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for this write-up.
On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote:
1) The ability Monitor how for replication has progressed in a
crashsafe manner to allow it to restart at the right point after
errors/crashes.
2) Efficiently identify the origin of individual changes and
transactions. In multimaster and some cascading scenarios it is
necessary to do so to avoid sending out replayed changes again.
3) Allow to efficiently filter out replayed changes from logical
decoding. It's currently possible to filter changes from inside the
output plugin, but it's much more efficient to filter them out
before decoding.
I agree with these goals.
Let me try to summarize the information requirements for each of these
things. For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message. For #2, you need to know, when decoding each
change, what the origin node was. And for #3, you need to know, when
decoding each change, whether it is of local origin. The information
requirements for #3 are a subset of those for #2.
A rather efficient solution for 1) is to attach the 'origin node' and
the remote commit LSN to every local commit record produced by
replay. That allows to have a shared memory "table" (remote_node,
local_lsn, remote_lsn).
This seems OK to me, modulo some arguing about what the origin node
information ought to look like. People who are not using logical
replication can use the compact form of the commit record in most
cases, and people who are using logical replication can pay for it.
Similarly, to solve the problem of identifying the origin of changes
during decoding, the problem can be solved nicely by adding the origin
node of every change to changes/transactions. At first it might seem
to be sufficient to do so on transaction level, but for cascading
scenarios it's very useful to be able to have changes from different
source transactions combinded into a larger one.
I think this is a lot more problematic. I agree that having the data
in the commit record isn't sufficient here, because for filtering
purposes (for example) you really want to identify the problematic
transactions at the beginning, so you can chuck their WAL, rather than
reassembling the transaction first and then throwing it out. But
putting the origin ID in every insert/update/delete is pretty
unappealing from my point of view - not just because it adds bytes to
WAL, though that's a non-trivial concern, but also because it adds
complexity - IMHO, a non-trivial amount of complexity. I'd be a lot
happier with a solution where, say, we have a separate WAL record that
says "XID 1234 will be decoding for origin 567 until further notice".
== Replication Identifiers ==
The above explains the need to have as small as possible identifiers
for nodes. Two years back I'd suggested that we rely on the user to
manually assign 16bit ids to individual nodes. Not very surprisingly
that was shot down because a) 16bit numbers are not descriptive b) a
per node identifier is problematic because it prohibits replication
inside the same cluster.So, what I've proposed since is to have two different forms of
identifiers. A long one, that's as descriptive as
$replication_solution wants. And a small one (16bit in my case) that's
*only meaningful within one node*. The long, user facing, identifier
is locally mapped to the short one.In the first version I proposed these long identifiers had a fixed
form, including the system identifier, timeline id, database id, and a
freeform name. That wasn't well received and I agree that that's too
narrow. I think it should be a freeform text of arbitrary length.Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.
I'm pretty fuzzy on how this actually works. Like, the short form
here is just getting injected into WAL by the apply process. How does
it figure out what value to inject? What if it injects a value that
doesn't have a short-to-long mapping? What's the point of the
short-to-long mappings in the first place? Is that only required
because of the possibility that there might be multiple replication
solutions in play on the same node?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-25 22:44:49 -0400, Robert Haas wrote:
Thanks for this write-up.
On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote:
1) The ability Monitor how for replication has progressed in a
crashsafe manner to allow it to restart at the right point after
errors/crashes.
2) Efficiently identify the origin of individual changes and
transactions. In multimaster and some cascading scenarios it is
necessary to do so to avoid sending out replayed changes again.
3) Allow to efficiently filter out replayed changes from logical
decoding. It's currently possible to filter changes from inside the
output plugin, but it's much more efficient to filter them out
before decoding.I agree with these goals.
Let me try to summarize the information requirements for each of these
things. For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message.
I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.
Say you replay the following:
SET synchronous_commit = off;
BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/10 */;
BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/20 */;
BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/30 */;
If postgres crashes at any point during this, we need to know whether we
successfully replayed up to 0/10, 0/20 or 0/30. Note that the problem
exists independent of s_c=off, it just excerbates the issue.
Then, after finishing recovery and discovering only 0/10 has persisted,
the standby can reconnect to the primary and do
START_REPLICATION SLOT .. LOGICAL 0/10;
and it'll receive all transactions that have committed since 0/10.
For #2, you need to know, when decoding each
change, what the origin node was. And for #3, you need to know, when
decoding each change, whether it is of local origin. The information
requirements for #3 are a subset of those for #2.
Right. For #3 it's more important to have the information available
efficiently on individual records.
A rather efficient solution for 1) is to attach the 'origin node' and
the remote commit LSN to every local commit record produced by
replay. That allows to have a shared memory "table" (remote_node,
local_lsn, remote_lsn).This seems OK to me, modulo some arguing about what the origin node
information ought to look like. People who are not using logical
replication can use the compact form of the commit record in most
cases, and people who are using logical replication can pay for it.
Exactly.
Similarly, to solve the problem of identifying the origin of changes
during decoding, the problem can be solved nicely by adding the origin
node of every change to changes/transactions. At first it might seem
to be sufficient to do so on transaction level, but for cascading
scenarios it's very useful to be able to have changes from different
source transactions combinded into a larger one.I think this is a lot more problematic. I agree that having the data
in the commit record isn't sufficient here, because for filtering
purposes (for example) you really want to identify the problematic
transactions at the beginning, so you can chuck their WAL, rather than
reassembling the transaction first and then throwing it out. But
putting the origin ID in every insert/update/delete is pretty
unappealing from my point of view - not just because it adds bytes to
WAL, though that's a non-trivial concern, but also because it adds
complexity - IMHO, a non-trivial amount of complexity. I'd be a lot
happier with a solution where, say, we have a separate WAL record that
says "XID 1234 will be decoding for origin 567 until further notice".
I think it actually ends up much simpler than what you propose. In the
apply process, you simply execute
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
or it's C equivalent. That sets a global variable which XLogInsert()
includes the record.
Note that this doesn't actually require any additional space in the WAL
- padding bytes in struct XLogRecord are used to store the
identifier. These have been unused at least since 8.0.
I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.
== Replication Identifiers ==
The above explains the need to have as small as possible identifiers
for nodes. Two years back I'd suggested that we rely on the user to
manually assign 16bit ids to individual nodes. Not very surprisingly
that was shot down because a) 16bit numbers are not descriptive b) a
per node identifier is problematic because it prohibits replication
inside the same cluster.So, what I've proposed since is to have two different forms of
identifiers. A long one, that's as descriptive as
$replication_solution wants. And a small one (16bit in my case) that's
*only meaningful within one node*. The long, user facing, identifier
is locally mapped to the short one.In the first version I proposed these long identifiers had a fixed
form, including the system identifier, timeline id, database id, and a
freeform name. That wasn't well received and I agree that that's too
narrow. I think it should be a freeform text of arbitrary length.Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.I'm pretty fuzzy on how this actually works. Like, the short form
here is just getting injected into WAL by the apply process. How does
it figure out what value to inject?
Thy apply process once does
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
that looks up the internal identifier and stores it in a global
variable. That's then filled in struct XLogRecord.
To setup the origin LSN of a transaction
SELECT pg_replication_identifier_setup_tx_origin('0/123456', '2013-12-11 15:14:59.219737+01')
is used. If setup that'll emit the 'extended' commit record with the
remote commit LSN.
What if it injects a value that doesn't have a short-to-long mapping?
Shouldn't be possible unless you drop a replication identifier after it
has been setup by *_replaying_from(). If we feel that's a actually
dangerous scenario we can prohibit it with a session level lock.
What's the point of the short-to-long mappings in the first place? Is
that only required because of the possibility that there might be
multiple replication solutions in play on the same node?
In my original proposal, 2 years+ back, I only used short numeric
ids. And people didn't like it because it requires coordination between
the replication solutions and possibly between servers. Using a string
identifier like in the above allows to easily build unique names; and
allows every solution to add the information it needs into replication
identifiers.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26/09/14 04:44, Robert Haas wrote:
On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.I'm pretty fuzzy on how this actually works. Like, the short form
here is just getting injected into WAL by the apply process. How does
it figure out what value to inject? What if it injects a value that
doesn't have a short-to-long mapping? What's the point of the
short-to-long mappings in the first place? Is that only required
because of the possibility that there might be multiple replication
solutions in play on the same node?
From my perspective the short-to-long mapping is mainly convenience
thing, long id should be something that can be used to map the
identifier to the specific node for the purposes of configuration,
monitoring, troubleshooting, etc. You also usually don't use just Oids
to represent the DB objects, I see some analogy there.
This could be potentially done by the solution itself, not by the
framework, but it seems logical (pardon the pun) that most (if not all)
solutions will want some kind of mapping of the generated ids to
something that represents the logical node.
So answer to your first two questions depends on the specific solution,
it can map it from connection configuration, it can get the it from the
output plugin as part of wire protocol, it can generate it based on some
internal logic, etc.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Let me try to summarize the information requirements for each of these
things. For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message.I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.
Ah, sorry, you're right: so, you need to know, after crash recovery,
for each machine you are replicating *from*, the last transaction (in
terms of LSN) from that server that you successfully replayed.
Similarly, to solve the problem of identifying the origin of changes
during decoding, the problem can be solved nicely by adding the origin
node of every change to changes/transactions. At first it might seem
to be sufficient to do so on transaction level, but for cascading
scenarios it's very useful to be able to have changes from different
source transactions combinded into a larger one.I think this is a lot more problematic. I agree that having the data
in the commit record isn't sufficient here, because for filtering
purposes (for example) you really want to identify the problematic
transactions at the beginning, so you can chuck their WAL, rather than
reassembling the transaction first and then throwing it out. But
putting the origin ID in every insert/update/delete is pretty
unappealing from my point of view - not just because it adds bytes to
WAL, though that's a non-trivial concern, but also because it adds
complexity - IMHO, a non-trivial amount of complexity. I'd be a lot
happier with a solution where, say, we have a separate WAL record that
says "XID 1234 will be decoding for origin 567 until further notice".I think it actually ends up much simpler than what you propose. In the
apply process, you simply execute
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
or it's C equivalent. That sets a global variable which XLogInsert()
includes the record.
Note that this doesn't actually require any additional space in the WAL
- padding bytes in struct XLogRecord are used to store the
identifier. These have been unused at least since 8.0.
Sure, that's simpler for logical decoding, for sure. That doesn't
make it the right decision for the system overall.
I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.
But again, that complexity pertains only to logical decoding.
Somebody who wants to tweak the WAL format for an UPDATE in the future
doesn't need to understand how this works, or care. You know me: I've
been a huge advocate of logical decoding. But just like row-level
security or BRIN indexes or any other feature, I think it needs to be
designed in a way that minimizes the impact it has on the rest of the
system. I simply don't believe your contention that this isn't adding
any complexity to the code path for regular DML operations. It's
entirely possible we could need bit space in those records in the
future for something that actually pertains to those operations; if
you've burned it for logical decoding, it'll be difficult to claw it
back. And what if Tom gets around, some day, to doing that pluggable
heap AM work? Then every heap AM has got to allow for those bits, and
maybe that doesn't happen to be free for them.
Admittedly, these are hypothetical scenarios, but I don't think
they're particularly far-fetched. And as a fringe benefit, if you do
it the way that I'm proposing, you can use an OID instead of a 16-bit
thing that we picked to be 16 bits because that happens to be 100% of
the available bit-space. Yeah, there's some complexity on decoding,
but it's minimal: one more piece of fixed-size state to track per XID.
That's trivial compared to what you've already got.
What's the point of the short-to-long mappings in the first place? Is
that only required because of the possibility that there might be
multiple replication solutions in play on the same node?In my original proposal, 2 years+ back, I only used short numeric
ids. And people didn't like it because it requires coordination between
the replication solutions and possibly between servers. Using a string
identifier like in the above allows to easily build unique names; and
allows every solution to add the information it needs into replication
identifiers.
I get that, but what I'm asking is why those mappings can't be managed
on a per-replication-solution basis. I think that's just because
there's a limited namespace and so coordination is needed between
multiple replication solutions that might possibly be running on the
same system. But I want to confirm if that's actually what you're
thinking.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-26 09:53:09 -0400, Robert Haas wrote:
On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Let me try to summarize the information requirements for each of these
things. For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message.I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.Ah, sorry, you're right: so, you need to know, after crash recovery,
for each machine you are replicating *from*, the last transaction (in
terms of LSN) from that server that you successfully replayed.
Precisely.
I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.But again, that complexity pertains only to logical decoding.
Somebody who wants to tweak the WAL format for an UPDATE in the future
doesn't need to understand how this works, or care.
I agree that that's a worthy goal. But I don't see how this isn't the
case with what I propose? This isn't happening on the level of
individual rmgrs/ams - there've been two padding bytes after 'xl_rmid'
in struct XLogRecord for a long time.
There's also the significant advantage that not basing this on the xid
allows it to work correctly with records not tied to a
transaction. There's not that much of that happening yet, but I've
several features in mind:
* separately replicate 2PC commits. 2PC commits don't have an xid
anymore... With some tooling on top replication 2PC in two phases
allow for very cool stuff. Like optionally synchronous multimaster
replication.
* I have a pending patch that allows to send 'messages' through logical
decoding - yielding a really fast and persistent queue. For that it's
useful have transactional *and* nontransactional messages.
* Sanely replicating CONCURRENTLY stuff gets harder if you tie things to
the xid.
The absolutely, super, uber most convincing reason is:
It's trivial to build tools to analyze how much WAL traffic is generated
by which replication stream and how much by originates locally. A
pg_xlogdump --stats=replication_identifier wouldn't be hard ;)
You know me: I've
been a huge advocate of logical decoding. But just like row-level
security or BRIN indexes or any other feature, I think it needs to be
designed in a way that minimizes the impact it has on the rest of the
system.
Totally agreed. And that always will take some arguing...
I simply don't believe your contention that this isn't adding
any complexity to the code path for regular DML operations. It's
entirely possible we could need bit space in those records in the
future for something that actually pertains to those operations; if
you've burned it for logical decoding, it'll be difficult to claw it
back. And what if Tom gets around, some day, to doing that pluggable
heap AM work? Then every heap AM has got to allow for those bits, and
maybe that doesn't happen to be free for them.
As explained above this isn't happening on the level of individual AMs.
Admittedly, these are hypothetical scenarios, but I don't think
they're particularly far-fetched. And as a fringe benefit, if you do
it the way that I'm proposing, you can use an OID instead of a 16-bit
thing that we picked to be 16 bits because that happens to be 100% of
the available bit-space. Yeah, there's some complexity on decoding,
but it's minimal: one more piece of fixed-size state to track per XID.
That's trivial compared to what you've already got.
But it forces you to track the xids/transactions. With my proposal you
can ignore transaction *entirely* unless they manipulate the
catalog. For concurrent OLTP workloads that's quite the advantage.
What's the point of the short-to-long mappings in the first place? Is
that only required because of the possibility that there might be
multiple replication solutions in play on the same node?In my original proposal, 2 years+ back, I only used short numeric
ids. And people didn't like it because it requires coordination between
the replication solutions and possibly between servers. Using a string
identifier like in the above allows to easily build unique names; and
allows every solution to add the information it needs into replication
identifiers.I get that, but what I'm asking is why those mappings can't be managed
on a per-replication-solution basis. I think that's just because
there's a limited namespace and so coordination is needed between
multiple replication solutions that might possibly be running on the
same system. But I want to confirm if that's actually what you're
thinking.
Yes, that and that such a mapping needs to be done across all database
are the primary reasons. As it's currently impossible to create further
shared relations you'd have to invent something living in the data
directory on filesystem level... Brr.
I think it'd also be much worse for debugging if there'd be no way to
map such a internal identifier back to the replication solution in some
form.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
As explained above this isn't happening on the level of individual AMs.
Well, that's even worse. You want to grab 100% of the available
generic bitspace applicable to all record types for purposes specific
to logical decoding (and it's still not really enough bits).
I get that, but what I'm asking is why those mappings can't be managed
on a per-replication-solution basis. I think that's just because
there's a limited namespace and so coordination is needed between
multiple replication solutions that might possibly be running on the
same system. But I want to confirm if that's actually what you're
thinking.Yes, that and that such a mapping needs to be done across all database
are the primary reasons. As it's currently impossible to create further
shared relations you'd have to invent something living in the data
directory on filesystem level... Brr.I think it'd also be much worse for debugging if there'd be no way to
map such a internal identifier back to the replication solution in some
form.
OK.
One question I have is what the structure of the names should be. It
seems some coordination could be needed here. I mean, suppose BDR
uses bdr:$NODENAME and Slony uses
$SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
we should decide that names ought to be of the form
<replication-solution>.<further-period-separated-components> or
something like that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
As explained above this isn't happening on the level of individual AMs.
Well, that's even worse. You want to grab 100% of the available
generic bitspace applicable to all record types for purposes specific
to logical decoding (and it's still not really enough bits).
I don't think that's a fair characterization. Right now it's available
to precisely nobody. You can't put any data in there in *any* way. It
just has been sitting around unused for at least 8 years.
One question I have is what the structure of the names should be. It
seems some coordination could be needed here. I mean, suppose BDR
uses bdr:$NODENAME and Slony uses
$SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
we should decide that names ought to be of the form
<replication-solution>.<further-period-separated-components> or
something like that.
I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
As explained above this isn't happening on the level of individual AMs.
Well, that's even worse. You want to grab 100% of the available
generic bitspace applicable to all record types for purposes specific
to logical decoding (and it's still not really enough bits).I don't think that's a fair characterization. Right now it's available
to precisely nobody. You can't put any data in there in *any* way. It
just has been sitting around unused for at least 8 years.
Huh? That's just to say that the unused bit space is, in fact,
unused. But so what? We've always been very careful about using up
things like infomask bits, because there are only so many bits
available, and when they're gone they are gone.
One question I have is what the structure of the names should be. It
seems some coordination could be needed here. I mean, suppose BDR
uses bdr:$NODENAME and Slony uses
$SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
we should decide that names ought to be of the form
<replication-solution>.<further-period-separated-components> or
something like that.I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?
Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name. Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-26 11:02:16 -0400, Robert Haas wrote:
On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
As explained above this isn't happening on the level of individual AMs.
Well, that's even worse. You want to grab 100% of the available
generic bitspace applicable to all record types for purposes specific
to logical decoding (and it's still not really enough bits).I don't think that's a fair characterization. Right now it's available
to precisely nobody. You can't put any data in there in *any* way. It
just has been sitting around unused for at least 8 years.Huh? That's just to say that the unused bit space is, in fact,
unused. But so what? We've always been very careful about using up
things like infomask bits, because there are only so many bits
available, and when they're gone they are gone.
I don't think that's a very meaningful comparison. The problem with
infomask bits is that it's impossible to change anything once added
because of pg_upgrade'ability. That problem does not exist for
XLogRecord. We've twiddled with the WAL format pretty much in every
release. We can reconsider every release.
I can't remember anyone but me thinking about using these two bytes. So
the comparison here really is using two free bytes vs. issuing at least
~30 (record + origin) for every replayed transaction. Don't think that's
a unfair tradeof.
One question I have is what the structure of the names should be. It
seems some coordination could be needed here. I mean, suppose BDR
uses bdr:$NODENAME and Slony uses
$SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
we should decide that names ought to be of the form
<replication-solution>.<further-period-separated-components> or
something like that.I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name. Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.
Ah. I thought you only wanted to suggest a separator between the
replication solution and it's internal dat. But you actually want to
suggest an internal separator to be used in the solution's namespace?
I'm fine with that. I don't think we can suggest much beyond that -
different solutions will have fundamentally differing requirements about
which information to store.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Huh? That's just to say that the unused bit space is, in fact,
unused. But so what? We've always been very careful about using up
things like infomask bits, because there are only so many bits
available, and when they're gone they are gone.I don't think that's a very meaningful comparison. The problem with
infomask bits is that it's impossible to change anything once added
because of pg_upgrade'ability. That problem does not exist for
XLogRecord. We've twiddled with the WAL format pretty much in every
release. We can reconsider every release.I can't remember anyone but me thinking about using these two bytes. So
the comparison here really is using two free bytes vs. issuing at least
~30 (record + origin) for every replayed transaction. Don't think that's
a unfair tradeof.
Mmph. You have a point about the WAL format being easier to change.
"Reconsidering", though, would mean that some developer who probably
isn't you needs those bytes for something that really is a more
general need than this, so they write a patch to get them back by
doing what I proposed - and then it gets rejected because it's not as
good for logical replication. So I'm not sure I really buy this as an
argument. For all practical purposes, if you grab them, they'll be
gone.
I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name. Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.Ah. I thought you only wanted to suggest a separator between the
replication solution and it's internal dat. But you actually want to
suggest an internal separator to be used in the solution's namespace?
I'm fine with that. I don't think we can suggest much beyond that -
different solutions will have fundamentally differing requirements about
which information to store.
Agreed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Huh? That's just to say that the unused bit space is, in fact,
unused. But so what? We've always been very careful about using up
things like infomask bits, because there are only so many bits
available, and when they're gone they are gone.I don't think that's a very meaningful comparison. The problem with
infomask bits is that it's impossible to change anything once added
because of pg_upgrade'ability. That problem does not exist for
XLogRecord. We've twiddled with the WAL format pretty much in every
release. We can reconsider every release.I can't remember anyone but me thinking about using these two bytes. So
the comparison here really is using two free bytes vs. issuing at least
~30 (record + origin) for every replayed transaction. Don't think that's
a unfair tradeof.Mmph. You have a point about the WAL format being easier to change.
"Reconsidering", though, would mean that some developer who probably
isn't you needs those bytes for something that really is a more
general need than this, so they write a patch to get them back by
doing what I proposed - and then it gets rejected because it's not as
good for logical replication. So I'm not sure I really buy this as an
argument. For all practical purposes, if you grab them, they'll be
gone.
Sure, it'll possibly not be trivial to move them elsewhere. On the other
hand, the padding bytes have been unused for 8+ years without somebody
laying "claim" on them but "me". I don't think it's a good idea to leave
them there unused when nobody even has proposed another use for a long
while. That'll just end up with them continuing to be unused. And
there's actually four more consecutive bytes on 64bit systems that are
unused.
Should there really be a dire need after that, we can simply bump the
record size. WAL volume wise it'd not be too bad to make the record a
tiny bit bigger - the header is only a relatively small fraction of the
entire content.
I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name. Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.Ah. I thought you only wanted to suggest a separator between the
replication solution and it's internal dat. But you actually want to
suggest an internal separator to be used in the solution's namespace?
I'm fine with that. I don't think we can suggest much beyond that -
different solutions will have fundamentally differing requirements about
which information to store.Agreed.
So, let's recommend underscores as that separator?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/26/2014 06:05 PM, Andres Freund wrote:
On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
Sure, it'll possibly not be trivial to move them elsewhere. On the other
hand, the padding bytes have been unused for 8+ years without somebody
laying "claim" on them but "me". I don't think it's a good idea to leave
them there unused when nobody even has proposed another use for a long
while. That'll just end up with them continuing to be unused. And
there's actually four more consecutive bytes on 64bit systems that are
unused.Should there really be a dire need after that, we can simply bump the
record size. WAL volume wise it'd not be too bad to make the record a
tiny bit bigger - the header is only a relatively small fraction of the
entire content.
If we were now increasing the WAL record size anyway for some unrelated
reason, would we be willing to increase it by a further 2 bytes for the
node identifier?
If the answer is 'no' then I don't think we can justify using the 2
padding bytes just because they are there and have been unused for
years. But if the answer is yes, we feel this important enough to
justfiy a slightly (2 byte) larger WAL record header then we shouldn't
use the excuse of maybe needing those 2 bytes for something else. When
something else comes along that needs the WAL space we'll have to
increase the record size.
To say that if some other patch comes along that needs the space we'll
redo this feature to use the method Robert describes is unrealistic. If
we think that the replication identifier isn't
general/important/necessary to justify 2 bytes of WAL header space then
we should start out with something that doesn't use the WAL header,
Steve
I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name. Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.Ah. I thought you only wanted to suggest a separator between the
replication solution and it's internal dat. But you actually want to
suggest an internal separator to be used in the solution's namespace?
I'm fine with that. I don't think we can suggest much beyond that -
different solutions will have fundamentally differing requirements about
which information to store.Agreed.
So, let's recommend underscores as that separator?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/26/2014 10:21 AM, Andres Freund wrote:
On 2014-09-26 09:53:09 -0400, Robert Haas wrote:
On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Let me try to summarize the information requirements for each of these
things. For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message.I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.Ah, sorry, you're right: so, you need to know, after crash recovery,
for each machine you are replicating *from*, the last transaction (in
terms of LSN) from that server that you successfully replayed.Precisely.
I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.But again, that complexity pertains only to logical decoding.
Somebody who wants to tweak the WAL format for an UPDATE in the future
doesn't need to understand how this works, or care.I agree that that's a worthy goal. But I don't see how this isn't the
case with what I propose? This isn't happening on the level of
individual rmgrs/ams - there've been two padding bytes after 'xl_rmid'
in struct XLogRecord for a long time.There's also the significant advantage that not basing this on the xid
allows it to work correctly with records not tied to a
transaction. There's not that much of that happening yet, but I've
several features in mind:* separately replicate 2PC commits. 2PC commits don't have an xid
anymore... With some tooling on top replication 2PC in two phases
allow for very cool stuff. Like optionally synchronous multimaster
replication.
* I have a pending patch that allows to send 'messages' through logical
decoding - yielding a really fast and persistent queue. For that it's
useful have transactional *and* nontransactional messages.
* Sanely replicating CONCURRENTLY stuff gets harder if you tie things to
the xid.
At what point in the decoding stream should something related to a
CONCURRENTLY command show up?
Also, for a logical message queue why couldn't you have a xid associated
with the message that had nothing else in the transaction?
l
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-27 12:11:16 -0400, Steve Singer wrote:
On 09/26/2014 06:05 PM, Andres Freund wrote:
On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
Sure, it'll possibly not be trivial to move them elsewhere. On the other
hand, the padding bytes have been unused for 8+ years without somebody
laying "claim" on them but "me". I don't think it's a good idea to leave
them there unused when nobody even has proposed another use for a long
while. That'll just end up with them continuing to be unused. And
there's actually four more consecutive bytes on 64bit systems that are
unused.Should there really be a dire need after that, we can simply bump the
record size. WAL volume wise it'd not be too bad to make the record a
tiny bit bigger - the header is only a relatively small fraction of the
entire content.If we were now increasing the WAL record size anyway for some unrelated
reason, would we be willing to increase it by a further 2 bytes for the node
identifier?
If the answer is 'no' then I don't think we can justify using the 2 padding
bytes just because they are there and have been unused for years. But if
the answer is yes, we feel this important enough to justfiy a slightly (2
byte) larger WAL record header then we shouldn't use the excuse of maybe
needing those 2 bytes for something else. When something else comes along
that needs the WAL space we'll have to increase the record size.
I don't think that's a good way to see this. By that argument these
bytes will never be used.
Also there's four more free bytes on 64bit systems...
To say that if some other patch comes along that needs the space we'll redo
this feature to use the method Robert describes is unrealistic. If we think
that the replication identifier isn't general/important/necessary to
justify 2 bytes of WAL header space then we should start out with something
that doesn't use the WAL header,
Maintaining complexity also has its costs. And I think that's much more
concrete than some imaginary feature (of which nothing was heard for the
last 8+ years) also needing two bytes.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 27, 2014 at 12:11 PM, Steve Singer <steve@ssinger.info> wrote:
If we were now increasing the WAL record size anyway for some unrelated
reason, would we be willing to increase it by a further 2 bytes for the node
identifier?
Obviously not. Otherwise Andres would be proposing to put an OID in
there instead of a kooky 16-bit identifier.
If the answer is 'no' then I don't think we can justify using the 2 padding
bytes just because they are there and have been unused for years. But if
the answer is yes, we feel this important enough to justfiy a slightly (2
byte) larger WAL record header then we shouldn't use the excuse of maybe
needing those 2 bytes for something else. When something else comes along
that needs the WAL space we'll have to increase the record size.To say that if some other patch comes along that needs the space we'll redo
this feature to use the method Robert describes is unrealistic. If we think
that the replication identifier isn't general/important/necessary to
justify 2 bytes of WAL header space then we should start out with something
that doesn't use the WAL header,
I lean in that direction too, but would welcome more input from others.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/23/2014 09:24 PM, Andres Freund wrote:
I've previously started two threads about replication identifiers. Check
http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
and
http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
.The've also been discussed in the course of another thread:
http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de
And even earlier here:
/messages/by-id/1339586927-13156-10-git-send-email-andres@2ndquadrant.com
The thread branched a lot, the relevant branch is the one with subject
"[PATCH 10/16] Introduce the concept that wal has a 'origin' node"
== Identify the origin of changes ==
Say you're building a replication solution that allows two nodes to
insert into the same table on two nodes. Ignoring conflict resolution
and similar fun, one needs to prevent the same change being replayed
over and over. In logical replication the changes to the heap have to
be WAL logged, and thus the *replay* of changes from a remote node
produce WAL which then will be decoded again.To avoid that it's very useful to tag individual changes/transactions
with their 'origin'. I.e. mark changes that have been directly
triggered by the user sending SQL as originating 'locally' and changes
originating from replaying another node's changes as originating
somewhere else.If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.
An origin column in the table itself helps tremendously to debug issues
with the replication system. In many if not most scenarios, I think
you'd want to have that extra column, even if it's not strictly required.
What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.
Adding a field to XLogRecord for this feels wrong. This is for *logical*
replication - why do you need to mess with something as physical as the
WAL record format?
And who's to say that a node ID is the most useful piece of information
for a replication system to add to the WAL header. I can easily imagine
that you'd want to put a changeset ID or something else in there,
instead. (I mentioned another example of this in
/messages/by-id/4FE17043.60403@enterprisedb.com)
If we need additional information added to WAL records, for extensions,
then that should be made in an extensible fashion. IIRC (I couldn't find
a link right now), when we discussed the changes to heap_insert et al
for wal_level=logical, I already argued back then that we should make it
possible for extensions to annotate WAL records, with things like "this
is the primary key", or whatever information is needed for conflict
resolution, or handling loops. I don't like it that we're adding little
pieces of information to the WAL format, bit by bit.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-02 11:49:31 +0300, Heikki Linnakangas wrote:
On 09/23/2014 09:24 PM, Andres Freund wrote:
I've previously started two threads about replication identifiers. Check
http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
and
http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
.The've also been discussed in the course of another thread:
http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.deAnd even earlier here:
/messages/by-id/1339586927-13156-10-git-send-email-andres@2ndquadrant.com
The thread branched a lot, the relevant branch is the one with subject
"[PATCH 10/16] Introduce the concept that wal has a 'origin' node"
Right. Long time ago already ;)
== Identify the origin of changes ==
Say you're building a replication solution that allows two nodes to
insert into the same table on two nodes. Ignoring conflict resolution
and similar fun, one needs to prevent the same change being replayed
over and over. In logical replication the changes to the heap have to
be WAL logged, and thus the *replay* of changes from a remote node
produce WAL which then will be decoded again.To avoid that it's very useful to tag individual changes/transactions
with their 'origin'. I.e. mark changes that have been directly
triggered by the user sending SQL as originating 'locally' and changes
originating from replaying another node's changes as originating
somewhere else.If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.An origin column in the table itself helps tremendously to debug issues with
the replication system. In many if not most scenarios, I think you'd want to
have that extra column, even if it's not strictly required.
I don't think you'll have much success convincing actual customers of
that. It's one thing to increase the size of the WAL stream a bit, it's
entirely different to persistently increase the table size of all their
tables.
What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.Adding a field to XLogRecord for this feels wrong. This is for *logical*
replication - why do you need to mess with something as physical as the WAL
record format?
XLogRecord isn't all that "physical". It doesn't encode anything in that
regard but the fact that there's backup blocks in the record. It's
essentially just an implementation detail of logging. Whether that's
physical or logical doesn't really matter much.
There's basically two primary reasons I think it's a good idea to add it
there:
a) There's many different type of records where it's useful to add the
origin. Adding the information to all these will make things more
complicated, using more space, and be more fragile. And I'm pretty
sure that the number of things people will want to expose over
logical replication will increase.
I know of at least two things that have at least some working code:
Exposing 2PC to logical decoding to allow optionally synchronous
replication, and allowing to send transactional/nontransactional
'messages' via the WAL without writing to a table.
Now, we could add a framework to attach general information to every
record - but I have a very hard time seing how this will be of
comparable complexity *and* efficiency.
b) It's dead simple with a pretty darn low cost. Both from a runtime as
well as a maintenance perspective.
c) There needs to be crash recovery interation anyway to compute the
state of how far replication succeeded before crashing. So it's not
like we could make this completely extensible without core code
knowing.
And who's to say that a node ID is the most useful piece of information for
a replication system to add to the WAL header. I can easily imagine that
you'd want to put a changeset ID or something else in there, instead. (I
mentioned another example of this in
/messages/by-id/4FE17043.60403@enterprisedb.com)
I'm onboard with adding a extensible facility to attach data to
successful transactions. There've been at least two people asking me
directly about how to e.g. attach user information to transactions.
I don't think that's equivalent with what I'm talking about here
though. One important thing about this proposal is that it allows to
completely skip (nearly, except cache inval) all records with a
uninteresting origin id *before* decoding them. Without having to keep
any per transaction state about 'uninteresting' transactions.
If we need additional information added to WAL records, for extensions, then
that should be made in an extensible fashion
I can see how we'd do that for individual records (e.g. the various
commit records, after combining them), but i have a hard time seing the
cost of doing that for all records worth it. Especially as it seems
likely to require significant increases in wal volume?
IIRC (I couldn't find a link
right now), when we discussed the changes to heap_insert et al for
wal_level=logical, I already argued back then that we should make it
possible for extensions to annotate WAL records, with things like "this is
the primary key", or whatever information is needed for conflict resolution,
or handling loops. I don't like it that we're adding little pieces of
information to the WAL format, bit by bit.
I don't think this is "adding little pieces of information to the WAL
format, bit by bit.". It's a relatively central piece for allowing
efficient and maintainable logical replication.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
An origin column in the table itself helps tremendously to debug issues with
the replication system. In many if not most scenarios, I think you'd want to
have that extra column, even if it's not strictly required.
I like a lot of what you wrote here, but I strongly disagree with this
part. A good replication solution shouldn't require changes to the
objects being replicated. The triggers that Slony and other current
logical replication solutions use are a problem not only because
they're slow (although that is a problem) but also because they
represent a user-visible wart that people who don't otherwise care
about the fact that their database is being replicated have to be
concerned with. I would agree that some people might, for particular
use cases, want to include origin information in the table that the
replication system knows about, but it shouldn't be required.
When you look at the replication systems that we have today, you've
basically got streaming replication, which is high-performance and
fairly hands-off (at least once you get it set up properly; that part
can be kind of a bear) but can't cross versions let alone database
systems and requires that the slaves be strictly read-only. Then on
the flip side you've got things like Slony, Bucardo, and others. Some
of these allow multi-master; all of them at least allow table-level
determination of which server has the writable copy. Nearly all of
them are cross-version and some even allow replication into
non-PostgreSQL systems. But they are slow *and administratively
complex*. If we're able to get something that feels like streaming
replication from a performance and administrative complexity point of
view but can be cross-version and allow at least some writes on
slaves, that's going to be an epic leap forward for the project.
In my mind, that means it's got to be completely hands-off from a
schema design point of view: you should be able to start up a database
and design it however you want, put anything you like into it, and
then decide later that you want to bolt logical replication on top of
it, just as you can for streaming physical replication.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/2/14, 7:28 AM, Robert Haas wrote:
On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:An origin column in the table itself helps tremendously to debug issues with
the replication system. In many if not most scenarios, I think you'd want to
have that extra column, even if it's not strictly required.I like a lot of what you wrote here, but I strongly disagree with this
part. A good replication solution shouldn't require changes to the
objects being replicated.
I agree that asking users to modify objects is bad, but I also think that if you do have records coming into one table from multiple sources then you will need to know what system they originated on.
Maybe some sort of "hidden" column would work here? That means users don't need to modify anything (including anything doing SELECT *), but the data is there.
As for space concerns I think the answer there is to somehow normalize the identifiers themselves. That has the added benefit of allowing a rename of a source to propagate to all the data already replicated from that source.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 October 2014 09:49, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.Adding a field to XLogRecord for this feels wrong. This is for *logical*
replication - why do you need to mess with something as physical as the WAL
record format?
Some reasons why this would be OK:
1. We've long agreed that adding information to the WAL record is OK,
just as long as it doesn't interfere (much) with the primary purpose.
It seems OK to change wal_level so it is a list of additional info,
e.g. wal_level = 'hot standby, logical, originid', or just wal_level =
'$new_level_name' that includes originid info
2. We seem to have agreed elsewhere that extensions to WAL will be
allowed. So perhaps redefining those bytes is something that can be
re-used. That way we don't all have to agree what we'll use them for.
Just call a user defined function that returns a 2 byte integer, or
zero if no plugin.
3. As for how many bytes there are - I count 6 wasted bytes on a
MAXALIGN=8 record. Plus we discussed long ago ways we can save another
4 bytes on records that don't have a backup block, since in that case
xl_tot_len == xl_len. I've also got a feeling that WAL records that
are 2 billion bytes long might be overkill. I figure we could easily
make a length limit of something less than that - only commit records
can be longer than 2^19 bytes when they have >65536 subxids, which is
hardly common. Plus xl_prev is always at least 4 byte aligned, so
there are at least 3 bits to be reclaimed from there. (Plus we have 7
unused RmgrId bits, though maybe we want to keep those)
So I count about 14 bytes of potentially reclaimable space in the WAL
record header, or 10 with backup blocks. The reason we never reclaimed
it before was that benchmarks previously showed that reducing the
volume of WAL had little effect on performance, we weren't looking
specifically at WAL volume or info content. (And the perf result is
probably different now anyway).
Should we grovel around to reclaim any of that? Not necessary, the
next person with a good reason to use some space can do that.
Pick any of those: I see no reason to prevent reusing 2 bytes for a
useful purpose, when we do all agree it is a useful purpose.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
Here's my next attept attempt at producing something we can agree
upon.
The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.
Other changes:
* Locking preventing several backends to replay changes at the same
time. This is actually overly restrictive in some cases, but I think
good enough for now.
* Logical decoding grew a filter_by_origin callback that allows to
ignore changes that were replayed on a remote system. Such filters are
executed before much is done with records, potentially saving a fair
bit of costs.
* Rebased. That took a bit due the xlog and other changes.
* A couple more SQL interface functions (like dropping a replication
identifier).
I also want to quickly recap replication identifiers, given that
in-person conversations with several people proved that the concept was
slightly misunderstood:
Think about a logical replication solution trying to replay changes. The
postmaster in which the data is replayed into crashes every now and
then. Replication identifiers allow you to do something like:
do_replication()
{
source = ConnectToSourceSystem('mysource');
target = ConnectToSourceSystem('target');
# mark we're replayin
target.exec($$SELECT pg_replication_identifier_setup_replaying_from('myrep_mysource')$$);
# get how far we've replayed last time round
remote_lsn = target.exec($$SELECT remote_lsn FROM pg_get_replication_identifier_progress WHERE external_id = 'myrep_mysource');
# and now replay changes
copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', remote_lsn);
while (record = copystream.get_record())
{
if (record.type = 'begin')
{
target.exec('BEGIN');
# setup the position of this individual xact
target.exec('SELECT pg_replication_identifier_setup_tx_origin($1, $2);',
record.origin_lsn, record.origin_commit_timestamp);
}
else if (record.type = 'change')
target.exec(record.change_sql)
else if (record.type = 'commit')
target.exec('COMMIT');
}
}
A non pseudocode version of the above would be safe against crashes of
both the source and the target system. If the target system crashes the
replication identifier logic will recover how far we replayed during
crash recovery. If the source system crashes/disconnects we'll have the
current value in memory. Note that this works perfectly well if the
target system (and obviously the source system, but that's obvious) use
synchronous_commit = off - we'll not miss any changes.
Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.
The SQL interface consists out of:
# manage existance of identifiers
internal_id pg_replication_identifier_create(external_id);
void pg_replication_identifier_drop(external_id);
# replay management
void pg_replication_identifier_setup_replaying_from(external_id);
void pg_replication_identifier_reset_replaying_from();
bool pg_replication_identifier_is_replaying();
void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time);
# replication progress status view
SELECT * FROM pgreplication_identifier_progress;
# replicatation identifiers
SELECT * FROM pg_replication_identifier;
Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a
SQL reimplementation of replication identifiers and that has proven that
for busier workloads doing a table update to store the replication
progress indeed has a noticeable overhead. Especially if there's some
longer running activity on the standby.
The bigger questions I have are:
1) Where to store the origin. I personally still think that using the
padding is fine. Now that I have proven that it's pretty simple to
store additional information the argument that it might be needed for
something else doesn't really hold anymore. But I can live with the
other solution as well - 3 bytes additional overhead ain't so bad.
2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do
we want to store the origin only on relevant records? That'd be
XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE //
XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED. I'm thinking of something
like XLogLogOriginIfAvailable() before the emitting log
XLogInsert()s.
3) There should be a lwlock for the individual replication identifier
progress slots.
4) Right now identifier progress is stored during checkpoints in special
files - maybe it'd be better to store them inside the checkpoint
record somehow. We read that even after a clean shutdown, so that
should be fine.
5) I'm think there are issues with a streaming replication standby if
many identifiers are created/dropped. Those shouldn't be too hard to
fix.
6) Obviously the hack in bootstrap.c to get riname marked NOT NULL isn't
acceptable. Either I need to implement boostrap support for marking
varlenas NOT NULL as discussed nearby or replace the syscache lookup
with a index lookup.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
Here's my next attept attempt at producing something we can agree
upon.The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.Other changes:
* Locking preventing several backends to replay changes at the same
time. This is actually overly restrictive in some cases, but I think
good enough for now.
* Logical decoding grew a filter_by_origin callback that allows to
ignore changes that were replayed on a remote system. Such filters are
executed before much is done with records, potentially saving a fair
bit of costs.
* Rebased. That took a bit due the xlog and other changes.
* A couple more SQL interface functions (like dropping a replication
identifier).
And here an actual patch.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Introduce-replication-identifiers-to-keep-track-of-r.patchtext/x-patch; charset=us-asciiDownload
>From 268d52cac6bf7fe1c019fd68248853c7c7ae18b1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Feb 2015 01:22:08 +0100
Subject: [PATCH] Introduce replication identifiers to keep track of
replication progress: v0.6
---
contrib/test_decoding/Makefile | 3 +-
contrib/test_decoding/expected/replident.out | 84 ++
contrib/test_decoding/sql/replident.sql | 40 +
contrib/test_decoding/test_decoding.c | 28 +
src/backend/access/rmgrdesc/xactdesc.c | 17 +-
src/backend/access/transam/xact.c | 64 +-
src/backend/access/transam/xlog.c | 34 +-
src/backend/access/transam/xloginsert.c | 22 +-
src/backend/access/transam/xlogreader.c | 10 +
src/backend/bootstrap/bootstrap.c | 5 +-
src/backend/catalog/Makefile | 2 +-
src/backend/catalog/catalog.c | 8 +-
src/backend/catalog/system_views.sql | 7 +
src/backend/replication/logical/Makefile | 3 +-
src/backend/replication/logical/decode.c | 63 +-
src/backend/replication/logical/logical.c | 5 +
src/backend/replication/logical/reorderbuffer.c | 5 +-
.../replication/logical/replication_identifier.c | 1190 ++++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/cache/syscache.c | 23 +
src/backend/utils/misc/guc.c | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/pg_resetxlog/pg_resetxlog.c | 3 +
src/include/access/xact.h | 10 +-
src/include/access/xlog.h | 1 +
src/include/access/xlogdefs.h | 6 +
src/include/access/xlogreader.h | 9 +
src/include/access/xlogrecord.h | 5 +-
src/include/catalog/indexing.h | 6 +
src/include/catalog/pg_proc.h | 28 +
src/include/catalog/pg_replication_identifier.h | 75 ++
src/include/pg_config_manual.h | 6 +
src/include/replication/output_plugin.h | 8 +
src/include/replication/reorderbuffer.h | 8 +-
src/include/replication/replication_identifier.h | 58 +
src/include/utils/syscache.h | 2 +
src/test/regress/expected/rules.out | 5 +
src/test/regress/expected/sanity_check.out | 1 +
38 files changed, 1816 insertions(+), 33 deletions(-)
create mode 100644 contrib/test_decoding/expected/replident.out
create mode 100644 contrib/test_decoding/sql/replident.sql
create mode 100644 src/backend/replication/logical/replication_identifier.c
create mode 100644 src/include/catalog/pg_replication_identifier.h
create mode 100644 src/include/replication/replication_identifier.h
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 438be44..f8334cc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -37,7 +37,8 @@ submake-isolation:
submake-test_decoding:
$(MAKE) -C $(top_builddir)/contrib/test_decoding
-REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel binary prepared
+REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel \
+ binary prepared replident
regresscheck: all | submake-regress submake-test_decoding
$(MKDIR_P) regression_output
diff --git a/contrib/test_decoding/expected/replident.out b/contrib/test_decoding/expected/replident.out
new file mode 100644
index 0000000..1c508a5
--- /dev/null
+++ b/contrib/test_decoding/expected/replident.out
@@ -0,0 +1,84 @@
+-- predictability
+SET synchronous_commit = on;
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+ pg_replication_identifier_create
+----------------------------------
+ 1
+(1 row)
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ BEGIN
+ table public.target_tbl: INSERT: id[integer]:1 data[text]:'BEGIN'
+ table public.target_tbl: INSERT: id[integer]:2 data[text]:'table public.origin_tbl: INSERT: id[integer]:1 data[text]:''will be replicated and decoded and decoded again'''
+ table public.target_tbl: INSERT: id[integer]:3 data[text]:'COMMIT'
+ COMMIT
+(5 rows)
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+-- mark session as replaying
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+ pg_replication_identifier_setup_replaying_from
+------------------------------------------------
+
+(1 row)
+
+BEGIN;
+-- setup transaction origins
+SELECT pg_replication_identifier_setup_tx_origin('0/ffffffff', '2013-01-01 00:00');
+ pg_replication_identifier_setup_tx_origin
+-------------------------------------------
+
+(1 row)
+
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+SELECT pg_replication_identifier_reset_replaying_from();
+ pg_replication_identifier_reset_replaying_from
+------------------------------------------------
+
+(1 row)
+
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+------
+(0 rows)
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+--------------------------------------------------------------------------------
+ BEGIN
+ table public.origin_tbl: INSERT: id[integer]:3 data[text]:'will be replicated'
+ COMMIT
+(3 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_replication_identifier_drop('test_decoding: regression_slot');
+ pg_replication_identifier_drop
+--------------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/replident.sql b/contrib/test_decoding/sql/replident.sql
new file mode 100644
index 0000000..f01836f
--- /dev/null
+++ b/contrib/test_decoding/sql/replident.sql
@@ -0,0 +1,40 @@
+-- predictability
+SET synchronous_commit = on;
+
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+
+-- mark session as replaying
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+
+BEGIN;
+-- setup transaction origins
+SELECT pg_replication_identifier_setup_tx_origin('0/ffffffff', '2013-01-01 00:00');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+
+SELECT pg_replication_identifier_reset_replaying_from();
+
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_replication_identifier_drop('test_decoding: regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 963d5df..2ec3001 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -21,6 +21,7 @@
#include "replication/output_plugin.h"
#include "replication/logical.h"
+#include "replication/replication_identifier.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -43,6 +44,7 @@ typedef struct
bool include_timestamp;
bool skip_empty_xacts;
bool xact_wrote_changes;
+ bool only_local;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +61,8 @@ static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
+static bool pg_decode_filter(LogicalDecodingContext *ctx,
+ RepNodeId origin_id);
void
_PG_init(void)
@@ -76,6 +80,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
}
@@ -97,6 +102,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_xids = true;
data->include_timestamp = false;
data->skip_empty_xacts = false;
+ data->only_local = false;
ctx->output_plugin_private = data;
@@ -155,6 +161,17 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "only-local") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->only_local = true;
+ else if (!parse_bool(strVal(elem->arg), &data->only_local))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -223,6 +240,17 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+static bool
+pg_decode_filter(LogicalDecodingContext *ctx,
+ RepNodeId origin_id)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->only_local && origin_id != InvalidRepNodeId)
+ return true;
+ return false;
+}
+
/*
* Print literal `outputstr' already represented as string of type `typid'
* into stringbuf `s'.
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 3e87978..0ec6b0f 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -25,9 +25,12 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
{
int i;
TransactionId *subxacts;
+ SharedInvalidationMessage *msgs;
subxacts = (TransactionId *) &xlrec->xnodes[xlrec->nrels];
+ msgs = (SharedInvalidationMessage *) &subxacts[xlrec->nsubxacts];
+
appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
if (xlrec->nrels > 0)
@@ -49,9 +52,6 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
}
if (xlrec->nmsgs > 0)
{
- SharedInvalidationMessage *msgs;
-
- msgs = (SharedInvalidationMessage *) &subxacts[xlrec->nsubxacts];
if (XactCompletionRelcacheInitFileInval(xlrec->xinfo))
appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
@@ -80,6 +80,17 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
appendStringInfo(buf, " unknown id %d", msg->id);
}
}
+ if (xlrec->xinfo & XACT_CONTAINS_ORIGIN)
+ {
+ xl_xact_origin *origin = (xl_xact_origin *) &(msgs[xlrec->nmsgs]);
+
+ appendStringInfo(buf, " origin %u, lsn %X/%X, at %s",
+ origin->origin_node_id,
+ (uint32)(origin->origin_lsn >> 32),
+ (uint32)origin->origin_lsn,
+ timestamptz_to_str(origin->origin_timestamp));
+ }
+
}
static void
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 97000ef..579f9cc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -40,8 +40,10 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/logical.h"
#include "replication/walsender.h"
#include "replication/syncrep.h"
+#include "replication/replication_identifier.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -1080,9 +1082,10 @@ RecordTransactionCommit(void)
* gracefully. Till then, it's just 20 bytes of overhead.
*/
if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval || forceSyncCommit ||
- XLogLogicalInfoActive())
+ XLogLogicalInfoActive() || replication_origin_id != InvalidRepNodeId)
{
xl_xact_commit xlrec;
+ xl_xact_origin origin;
/*
* Set flags required for recovery processing of commits.
@@ -1115,6 +1118,19 @@ RecordTransactionCommit(void)
if (nmsgs > 0)
XLogRegisterData((char *) invalMessages,
nmsgs * sizeof(SharedInvalidationMessage));
+ /* dump transaction origin information */
+ if (replication_origin_id != InvalidRepNodeId)
+ {
+ xlrec.xinfo |= XACT_CONTAINS_ORIGIN;
+
+ origin.origin_node_id = replication_origin_id;
+ origin.origin_lsn = replication_origin_lsn;
+ origin.origin_timestamp = replication_origin_timestamp;
+
+ XLogRegisterData((char *) &origin,
+ sizeof(xl_xact_origin));
+
+ }
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT);
}
else
@@ -1135,6 +1151,13 @@ RecordTransactionCommit(void)
}
}
+ /* record plain commit ts if not replaying remote actions */
+ if (replication_origin_id == InvalidRepNodeId ||
+ replication_origin_id == DoNotReplicateRepNodeId)
+ replication_origin_timestamp = xactStopTimestamp;
+ else
+ AdvanceCachedReplicationIdentifier(replication_origin_lsn, XactLastRecEnd);
+
/*
* We only need to log the commit timestamp separately if the node
* identifier is a valid value; the commit record above already contains
@@ -1146,7 +1169,7 @@ RecordTransactionCommit(void)
node_id = CommitTsGetDefaultNodeId();
TransactionTreeSetCommitTsData(xid, nchildren, children,
- xactStopTimestamp,
+ replication_origin_timestamp,
node_id, node_id != InvalidCommitTsNodeId);
}
@@ -1230,9 +1253,11 @@ RecordTransactionCommit(void)
if (wrote_xlog)
SyncRepWaitForLSN(XactLastRecEnd);
+ /* remember end of last commit record */
+ XactLastCommitEnd = XactLastRecEnd;
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
-
cleanup:
/* Clean up local data */
if (rels)
@@ -4665,10 +4690,12 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
SharedInvalidationMessage *inval_msgs, int nmsgs,
RelFileNode *xnodes, int nrels,
Oid dbId, Oid tsId,
- uint32 xinfo)
+ uint32 xinfo,
+ xl_xact_origin *origin)
{
TransactionId max_xid;
int i;
+ RepNodeId origin_node_id = InvalidRepNodeId;
max_xid = TransactionIdLatest(xid, nsubxacts, sub_xids);
@@ -4688,9 +4715,18 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
LWLockRelease(XidGenLock);
}
+ Assert(!!(xinfo & XACT_CONTAINS_ORIGIN) == (origin != NULL));
+
+ if (xinfo & XACT_CONTAINS_ORIGIN)
+ {
+ origin_node_id = origin->origin_node_id;
+ commit_time = origin->origin_timestamp;
+ }
+
/* Set the transaction commit timestamp and metadata */
TransactionTreeSetCommitTsData(xid, nsubxacts, sub_xids,
- commit_time, InvalidCommitTsNodeId, false);
+ commit_time, origin_node_id, false);
+
if (standbyState == STANDBY_DISABLED)
{
@@ -4747,6 +4783,14 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
StandbyReleaseLockTree(xid, 0, NULL);
}
+ if (xinfo & XACT_CONTAINS_ORIGIN)
+ {
+ /* recover apply progress */
+ AdvanceReplicationIdentifier(origin_node_id,
+ origin->origin_lsn,
+ lsn);
+ }
+
/* Make sure files supposed to be dropped are dropped */
if (nrels > 0)
{
@@ -4805,19 +4849,24 @@ xact_redo_commit(xl_xact_commit *xlrec,
{
TransactionId *subxacts;
SharedInvalidationMessage *inval_msgs;
+ xl_xact_origin *origin = NULL;
/* subxid array follows relfilenodes */
subxacts = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
/* invalidation messages array follows subxids */
inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
+ if (xlrec->xinfo & XACT_CONTAINS_ORIGIN)
+ origin = (xl_xact_origin *) &(inval_msgs[xlrec->nmsgs]);
+
xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
subxacts, xlrec->nsubxacts,
inval_msgs, xlrec->nmsgs,
xlrec->xnodes, xlrec->nrels,
xlrec->dbId,
xlrec->tsId,
- xlrec->xinfo);
+ xlrec->xinfo,
+ origin);
}
/*
@@ -4833,7 +4882,8 @@ xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
NULL, 0, /* relfilenodes */
InvalidOid, /* dbId */
InvalidOid, /* tsId */
- 0); /* xinfo */
+ 0, /* xinfo */
+ NULL /* origin */);
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..5eb0ef5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
#include "postmaster/startup.h"
#include "replication/logical.h"
#include "replication/slot.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
@@ -297,6 +298,8 @@ static XLogRecPtr ProcLastRecPtr = InvalidXLogRecPtr;
XLogRecPtr XactLastRecEnd = InvalidXLogRecPtr;
+XLogRecPtr XactLastCommitEnd = InvalidXLogRecPtr;
+
/*
* RedoRecPtr is this backend's local copy of the REDO record pointer
* (which is almost but not quite the same as a pointer to the most recent
@@ -6014,6 +6017,11 @@ StartupXLOG(void)
StartupMultiXact();
/*
+ * Recover knowledge about replay progress of known replication partners.
+ */
+ StartupReplicationIdentifier(checkPoint.redo);
+
+ /*
* Initialize unlogged LSN. On a clean shutdown, it's restored from the
* control file. On recovery, all unlogged relations are blown away, so
* the unlogged LSN counter can be reset too.
@@ -7645,6 +7653,7 @@ CreateCheckPoint(int flags)
XLogRecPtr recptr;
XLogCtlInsert *Insert = &XLogCtl->Insert;
uint32 freespace;
+ XLogRecPtr oldRedoPtr;
XLogSegNo _logSegNo;
XLogRecPtr curInsert;
VirtualTransactionId *vxids;
@@ -7960,10 +7969,10 @@ CreateCheckPoint(int flags)
(errmsg("concurrent transaction log activity while database system is shutting down")));
/*
- * Select point at which we can truncate the log, which we base on the
- * prior checkpoint's earliest info.
+ * Select point at which we can truncate the log (and other resources
+ * related to it), which we base on the prior checkpoint's earliest info.
*/
- XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+ oldRedoPtr = ControlFile->checkPointCopy.redo;
/*
* Update the control file.
@@ -8018,6 +8027,7 @@ CreateCheckPoint(int flags)
* Delete old log files (those no longer needed even for previous
* checkpoint or the standbys in XLOG streaming).
*/
+ XLByteToSeg(oldRedoPtr, _logSegNo);
if (_logSegNo)
{
KeepLogSeg(recptr, &_logSegNo);
@@ -8047,6 +8057,13 @@ CreateCheckPoint(int flags)
*/
TruncateMultiXact();
+ /*
+ * Remove old replication identifier checkpoints. We're using the previous
+ * checkpoint's redo ptr as a cutoff - even if we were to use that
+ * checkpoint to startup we're not going to need anything older.
+ */
+ TruncateReplicationIdentifier(oldRedoPtr);
+
/* Real work is done, but log and update stats before releasing lock. */
LogCheckpointEnd(false);
@@ -8130,6 +8147,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointBuffers(flags); /* performs all required fsyncs */
+ CheckPointReplicationIdentifier(checkPointRedo);
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
@@ -8190,6 +8208,7 @@ CreateRestartPoint(int flags)
{
XLogRecPtr lastCheckPointRecPtr;
CheckPoint lastCheckPoint;
+ XLogRecPtr oldRedoPtr;
XLogSegNo _logSegNo;
TimestampTz xtime;
@@ -8289,7 +8308,7 @@ CreateRestartPoint(int flags)
* Select point at which we can truncate the xlog, which we base on the
* prior checkpoint's earliest info.
*/
- XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+ oldRedoPtr = ControlFile->checkPointCopy.redo;
/*
* Update pg_control, using current time. Check that it still shows
@@ -8316,6 +8335,7 @@ CreateRestartPoint(int flags)
* checkpoint/restartpoint) to prevent the disk holding the xlog from
* growing full.
*/
+ XLByteToSeg(oldRedoPtr, _logSegNo);
if (_logSegNo)
{
XLogRecPtr receivePtr;
@@ -8385,6 +8405,12 @@ CreateRestartPoint(int flags)
TruncateMultiXact();
/*
+ * Also truncate replication identifiers. c.f. CreateCheckPoint()'s
+ * comment.
+ */
+ TruncateReplicationIdentifier(oldRedoPtr);
+
+ /*
* Truncate pg_subtrans if possible. We can throw away all data before
* the oldest XMIN of any running transaction. No future transaction will
* attempt to reference any pg_subtrans entry older than that (see Asserts
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index a1e2eb8..a91298b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -25,6 +25,7 @@
#include "access/xloginsert.h"
#include "catalog/pg_control.h"
#include "miscadmin.h"
+#include "replication/replication_identifier.h"
#include "storage/bufmgr.h"
#include "storage/proc.h"
#include "utils/memutils.h"
@@ -76,10 +77,16 @@ static uint32 mainrdata_len; /* total # of bytes in chain */
static XLogRecData hdr_rdt;
static char *hdr_scratch = NULL;
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+#define SizeOfXlogOrigin 0
+#else
+#define SizeOfXlogOrigin (sizeof(RepNodeId) + sizeof(XLR_BLOCK_ID_ORIGIN))
+#endif
+
#define HEADER_SCRATCH_SIZE \
(SizeOfXLogRecord + \
MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
- SizeOfXLogRecordDataHeaderLong)
+ SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
/*
* An array of XLogRecData structs, to hold registered data.
@@ -629,6 +636,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
scratch += sizeof(BlockNumber);
}
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ /* followed by the record's origin, if any */
+ if (replication_origin_id != InvalidRepNodeId)
+ {
+ *(scratch++) = XLR_BLOCK_ID_ORIGIN;
+ memcpy(scratch, &replication_origin_id, sizeof(replication_origin_id));
+ scratch += sizeof(replication_origin_id);
+ }
+#endif
+
/* followed by main data, if any */
if (mainrdata_len > 0)
{
@@ -674,6 +691,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
rechdr->xl_tot_len = total_len;
rechdr->xl_info = info;
rechdr->xl_rmid = rmid;
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+ rechdr->xl_origin_id = replication_origin_id;
+#endif
rechdr->xl_prev = InvalidXLogRecPtr;
rechdr->xl_crc = rdata_crc;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 60470b5..f8233a0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -20,6 +20,7 @@
#include "access/xlog_internal.h"
#include "access/xlogreader.h"
#include "catalog/pg_control.h"
+#include "replication/replication_identifier.h"
static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -956,6 +957,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
ResetDecoder(state);
state->decoded_record = record;
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ state->record_origin = InvalidRepNodeId;
+#endif
ptr = (char *) record;
ptr += SizeOfXLogRecord;
@@ -990,6 +994,12 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
break; /* by convention, the main data fragment is
* always last */
}
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ else if (block_id == XLR_BLOCK_ID_ORIGIN)
+ {
+ COPY_HEADER_FIELD(&state->record_origin, sizeof(RepNodeId));
+ }
+#endif
else if (block_id <= XLR_MAX_BLOCK_ID)
{
/* XLogRecordBlockHeader */
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index bc66eac..e2de408 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -705,10 +705,13 @@ DefineAttr(char *name, char *type, int attnum)
* oidvector and int2vector are also treated as not-nullable, even though
* they are no longer fixed-width.
*/
+ /* FIXME!!!! */
#define MARKNOTNULL(att) \
((att)->attlen > 0 || \
(att)->atttypid == OIDVECTOROID || \
- (att)->atttypid == INT2VECTOROID)
+ (att)->atttypid == INT2VECTOROID || \
+ strcmp(NameStr((att)->attname), "riname") == 0 \
+ )
if (MARKNOTNULL(attrtypes[attnum]))
{
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index a403c64..5b04550 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -39,7 +39,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
pg_ts_config.h pg_ts_config_map.h pg_ts_dict.h \
pg_ts_parser.h pg_ts_template.h pg_extension.h \
pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
- pg_foreign_table.h pg_policy.h \
+ pg_foreign_table.h pg_policy.h pg_replication_identifier.h \
pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \
toasting.h indexing.h \
)
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 8e7a9ec..318d65a 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -32,6 +32,7 @@
#include "catalog/pg_namespace.h"
#include "catalog/pg_pltemplate.h"
#include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_replication_identifier.h"
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
@@ -224,7 +225,8 @@ IsSharedRelation(Oid relationId)
relationId == SharedDependRelationId ||
relationId == SharedSecLabelRelationId ||
relationId == TableSpaceRelationId ||
- relationId == DbRoleSettingRelationId)
+ relationId == DbRoleSettingRelationId ||
+ relationId == ReplicationIdentifierRelationId)
return true;
/* These are their indexes (see indexing.h) */
if (relationId == AuthIdRolnameIndexId ||
@@ -240,7 +242,9 @@ IsSharedRelation(Oid relationId)
relationId == SharedSecLabelObjectIndexId ||
relationId == TablespaceOidIndexId ||
relationId == TablespaceNameIndexId ||
- relationId == DbRoleSettingDatidRolidIndexId)
+ relationId == DbRoleSettingDatidRolidIndexId ||
+ relationId == ReplicationLocalIdentIndex ||
+ relationId == ReplicationExternalIdentIndex)
return true;
/* These are their toast tables and toast indexes (see toasting.h) */
if (relationId == PgShdescriptionToastTable ||
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5e69e2b..4559f99 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -769,6 +769,13 @@ CREATE VIEW pg_user_mappings AS
REVOKE ALL on pg_user_mapping FROM public;
+
+CREATE VIEW pg_replication_identifier_progress AS
+ SELECT *
+ FROM pg_get_replication_identifier_progress();
+
+REVOKE ALL ON pg_replication_identifier_progress FROM public;
+
--
-- We have a few function definitions in here, too.
-- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index 310a45c..95bcffb 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
-OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o snapbuild.o
+OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o replication_identifier.o \
+ snapbuild.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 77c02ba..f8f7016 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -40,6 +40,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "storage/standby.h"
@@ -67,7 +68,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
TransactionId xid, Oid dboid,
TimestampTz commit_time,
int nsubxacts, TransactionId *sub_xids,
- int ninval_msgs, SharedInvalidationMessage *msg);
+ int ninval_msgs, SharedInvalidationMessage *msg,
+ xl_xact_origin *origin);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecPtr lsn,
TransactionId xid, TransactionId *sub_xids, int nsubxacts);
@@ -201,16 +203,20 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_xact_commit *xlrec;
TransactionId *subxacts = NULL;
SharedInvalidationMessage *invals = NULL;
+ xl_xact_origin *origin = NULL;
xlrec = (xl_xact_commit *) XLogRecGetData(r);
subxacts = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
invals = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
+ if (xlrec->xinfo & XACT_CONTAINS_ORIGIN)
+ origin = (xl_xact_origin *) &(invals[xlrec->nmsgs]);
+
DecodeCommit(ctx, buf, XLogRecGetXid(r), xlrec->dbId,
xlrec->xact_time,
xlrec->nsubxacts, subxacts,
- xlrec->nmsgs, invals);
+ xlrec->nmsgs, invals, origin);
break;
}
@@ -220,6 +226,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_xact_commit *xlrec;
TransactionId *subxacts;
SharedInvalidationMessage *invals = NULL;
+ xl_xact_origin *origin = NULL;
/* Prepared commits contain a normal commit record... */
prec = (xl_xact_commit_prepared *) XLogRecGetData(r);
@@ -228,10 +235,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
subxacts = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
invals = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
+ if (xlrec->xinfo & XACT_CONTAINS_ORIGIN)
+ origin = (xl_xact_origin *) &(invals[xlrec->nmsgs]);
+
DecodeCommit(ctx, buf, prec->xid, xlrec->dbId,
xlrec->xact_time,
xlrec->nsubxacts, subxacts,
- xlrec->nmsgs, invals);
+ xlrec->nmsgs, invals, origin);
break;
}
@@ -244,7 +254,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeCommit(ctx, buf, XLogRecGetXid(r), InvalidOid,
xlrec->xact_time,
xlrec->nsubxacts, xlrec->subxacts,
- 0, NULL);
+ 0, NULL, NULL);
break;
}
case XLOG_XACT_ABORT:
@@ -480,10 +490,19 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
TransactionId xid, Oid dboid,
TimestampTz commit_time,
int nsubxacts, TransactionId *sub_xids,
- int ninval_msgs, SharedInvalidationMessage *msgs)
+ int ninval_msgs, SharedInvalidationMessage *msgs,
+ xl_xact_origin *origin)
{
+ RepNodeId origin_id = InvalidRepNodeId;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
int i;
+ if (origin != NULL)
+ {
+ origin_id = origin->origin_node_id;
+ origin_lsn = origin->origin_lsn;
+ }
+
/*
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
@@ -504,12 +523,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* the reorderbuffer to forget the content of the (sub-)transactions
* if not.
*
- * There basically two reasons we might not be interested in this
+ * There can be several reasons we might not be interested in this
* transaction:
* 1) We might not be interested in decoding transactions up to this
* LSN. This can happen because we previously decoded it and now just
* are restarting or if we haven't assembled a consistent snapshot yet.
* 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
*
* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
@@ -524,7 +544,9 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* ---
*/
if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
- (dboid != InvalidOid && dboid != ctx->slot->data.database))
+ (dboid != InvalidOid && dboid != ctx->slot->data.database) ||
+ (ctx->callbacks.filter_by_origin_cb &&
+ ctx->callbacks.filter_by_origin_cb(ctx, origin_id)))
{
for (i = 0; i < nsubxacts; i++)
{
@@ -546,7 +568,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
/* replay actions of all transaction + subtransactions in order */
ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time);
+ commit_time, origin_id, origin_lsn);
}
/*
@@ -590,8 +612,14 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (ctx->callbacks.filter_by_origin_cb &&
+ ctx->callbacks.filter_by_origin_cb(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -632,8 +660,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (ctx->callbacks.filter_by_origin_cb &&
+ ctx->callbacks.filter_by_origin_cb(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_UPDATE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -681,8 +715,14 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (ctx->callbacks.filter_by_origin_cb &&
+ ctx->callbacks.filter_by_origin_cb(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_DELETE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
@@ -726,6 +766,11 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (rnode.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (ctx->callbacks.filter_by_origin_cb &&
+ ctx->callbacks.filter_by_origin_cb(ctx, XLogRecGetOrigin(r)))
+ return;
+
tupledata = XLogRecGetBlockData(r, 0, &tuplelen);
data = tupledata;
@@ -738,6 +783,8 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
+
memcpy(&change->data.tp.relnode, &rnode, sizeof(RelFileNode));
/*
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 30baa45..638a663 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -39,6 +39,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "storage/proc.h"
@@ -46,6 +47,10 @@
#include "utils/memutils.h"
+RepNodeId replication_origin_id = InvalidRepNodeId; /* assumed identity */
+XLogRecPtr replication_origin_lsn;
+TimestampTz replication_origin_timestamp;
+
/* data for errcontext callback */
typedef struct LogicalErrorCallbackState
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bcd5896..30086c9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1255,7 +1255,8 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
void
ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time)
+ TimestampTz commit_time,
+ RepNodeId origin_id, XLogRecPtr origin_lsn)
{
ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
@@ -1273,6 +1274,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
/* serialize the last bunch of changes if we need start earlier anyway */
if (txn->nentries_mem != txn->nentries)
diff --git a/src/backend/replication/logical/replication_identifier.c b/src/backend/replication/logical/replication_identifier.c
new file mode 100644
index 0000000..1364cea
--- /dev/null
+++ b/src/backend/replication/logical/replication_identifier.c
@@ -0,0 +1,1190 @@
+/*-------------------------------------------------------------------------
+ *
+ * replication_identifier.c
+ * Logical Replication Node Identifier and replication progress persistency
+ * support.
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/logical/replication_identifier.c
+ *
+ */
+
+#include "postgres.h"
+
+#include <unistd.h>
+#include <sys/stat.h>
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+
+#include "catalog/indexing.h"
+
+#include "nodes/execnodes.h"
+
+#include "replication/replication_identifier.h"
+#include "replication/logical.h"
+
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/copydir.h"
+
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/pg_lsn.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+
+/*
+ * Replay progress of a single remote node.
+ */
+typedef struct ReplicationState
+{
+ /*
+ * Local identifier for the remote node.
+ */
+ RepNodeId local_identifier;
+
+ /*
+ * Location of the latest commit from the remote side.
+ */
+ XLogRecPtr remote_lsn;
+
+ /*
+ * Remember the local lsn of the commit record so we can XLogFlush() to it
+ * during a checkpoint so we know the commit record actually is safe on
+ * disk.
+ */
+ XLogRecPtr local_lsn;
+
+ /*
+ * Slot is setup in backend?
+ */
+ pid_t acquired_by;
+} ReplicationState;
+
+/*
+ * On disk version of ReplicationState.
+ */
+typedef struct ReplicationStateOnDisk
+{
+ RepNodeId local_identifier;
+ XLogRecPtr remote_lsn;
+} ReplicationStateOnDisk;
+
+
+/*
+ * Base address into a shared memory array of replication states of size
+ * max_replication_slots.
+ *
+ * XXX: Should we use a separate variable to size this rather than
+ * max_replication_slots?
+ */
+static ReplicationState *ReplicationStates;
+
+/*
+ * Backend-local, cached element from ReplicationStates for use in a backend
+ * replaying remote commits, so we don't have to search ReplicationStates for
+ * the backends current RepNodeId.
+ */
+static ReplicationState *local_replication_state = NULL;
+
+/* Magic for on disk files. */
+#define REPLICATION_STATE_MAGIC (uint32)0x1257DADE
+
+/* XXX: move to c.h? */
+#ifndef UINT16_MAX
+#define UINT16_MAX ((1<<16) - 1)
+#else
+#if UINT16_MAX != ((1<<16) - 1)
+#error "uh, wrong UINT16_MAX?"
+#endif
+#endif
+
+/*
+ * Check for a persistent repication identifier identified by the replication
+ * identifier's external name..
+ *
+ * Returns InvalidOid if the node isn't known yet.
+ */
+RepNodeId
+GetReplicationIdentifier(char *riname, bool missing_ok)
+{
+ Form_pg_replication_identifier ident;
+ Oid riident = InvalidOid;
+ HeapTuple tuple;
+ Datum riname_d;
+
+ riname_d = CStringGetTextDatum(riname);
+
+ tuple = SearchSysCache1(REPLIDREMOTE, riname_d);
+ if (HeapTupleIsValid(tuple))
+ {
+ ident = (Form_pg_replication_identifier) GETSTRUCT(tuple);
+ riident = ident->riident;
+ ReleaseSysCache(tuple);
+ }
+ else if (!missing_ok)
+ elog(ERROR, "cache lookup failed for replication identifier named %s",
+ riname);
+
+ return riident;
+}
+
+/*
+ * Create a persistent replication identifier.
+ *
+ * Needs to be called in a transaction.
+ */
+RepNodeId
+CreateReplicationIdentifier(char *riname)
+{
+ Oid riident;
+ HeapTuple tuple = NULL;
+ Relation rel;
+ Datum riname_d;
+ SnapshotData SnapshotDirty;
+ SysScanDesc scan;
+ ScanKeyData key;
+
+ riname_d = CStringGetTextDatum(riname);
+
+ Assert(IsTransactionState());
+
+ /*
+ * We need the numeric replication identifiers to be 16bit wide, so we
+ * cannot rely on the normal oid allocation. So we simply scan
+ * pg_replication_identifier for the first unused id. That's not
+ * particularly efficient, but this should be an fairly infrequent
+ * operation - we can easily spend a bit more code on this when it turns
+ * out it needs to be faster.
+ *
+ * We handle concurrency by taking an exclusive lock (allowing reads!)
+ * over the table for the duration of the search. Because we use a "dirty
+ * snapshot" we can read rows that other in-progress sessions have
+ * written, even though they would be invisible with normal snapshots. Due
+ * to the exclusive lock there's no danger that new rows can appear while
+ * we're checking.
+ */
+ InitDirtySnapshot(SnapshotDirty);
+
+ rel = heap_open(ReplicationIdentifierRelationId, ExclusiveLock);
+
+ for (riident = InvalidOid + 1; riident < UINT16_MAX; riident++)
+ {
+ bool nulls[Natts_pg_replication_identifier];
+ Datum values[Natts_pg_replication_identifier];
+ bool collides;
+ CHECK_FOR_INTERRUPTS();
+
+ ScanKeyInit(&key,
+ Anum_pg_replication_riident,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(riident));
+
+ scan = systable_beginscan(rel, ReplicationLocalIdentIndex,
+ true /* indexOK */,
+ &SnapshotDirty,
+ 1, &key);
+
+ collides = HeapTupleIsValid(systable_getnext(scan));
+
+ systable_endscan(scan);
+
+ if (!collides)
+ {
+ /*
+ * Ok, found an unused riident, insert the new row and do a CCI,
+ * so our callers can look it up if they want to.
+ */
+ memset(&nulls, 0, sizeof(nulls));
+
+ values[Anum_pg_replication_riident -1] = ObjectIdGetDatum(riident);
+ values[Anum_pg_replication_riname - 1] = riname_d;
+
+ tuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
+ simple_heap_insert(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple);
+ CommandCounterIncrement();
+ break;
+ }
+ }
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+
+ if (tuple == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("no free replication id could be found")));
+
+ heap_freetuple(tuple);
+ return riident;
+}
+
+
+/*
+ * Create a persistent replication identifier.
+ *
+ * Needs to be called in a transaction.
+ */
+void
+DropReplicationIdentifier(RepNodeId riident)
+{
+ HeapTuple tuple = NULL;
+ Relation rel;
+ SnapshotData SnapshotDirty;
+ SysScanDesc scan;
+ ScanKeyData key;
+ int i;
+
+ Assert(IsTransactionState());
+
+ InitDirtySnapshot(SnapshotDirty);
+
+ rel = heap_open(ReplicationIdentifierRelationId, ExclusiveLock);
+
+ /* cleanup the slot state info */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state = &ReplicationStates[i];
+
+ /* found our slot */
+ if (state->local_identifier == riident)
+ {
+ if (state->acquired_by != 0)
+ {
+ elog(ERROR, "cannot drop slot that is setup in backend %d",
+ state->acquired_by);
+ }
+ memset(state, 0, sizeof(ReplicationState));
+ break;
+ }
+ }
+
+ ScanKeyInit(&key,
+ Anum_pg_replication_riident,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(riident));
+
+ scan = systable_beginscan(rel, ReplicationLocalIdentIndex,
+ true /* indexOK */,
+ &SnapshotDirty,
+ 1, &key);
+
+ tuple = systable_getnext(scan);
+
+ if (HeapTupleIsValid(tuple))
+ simple_heap_delete(rel, &tuple->t_self);
+
+ systable_endscan(scan);
+
+ CommandCounterIncrement();
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+}
+
+
+/*
+ * Lookup pg_replication_identifier tuple via its riident.
+ *
+ * The result needs to be ReleaseSysCache'ed and is an invalid HeapTuple if
+ * the lookup failed.
+ */
+void
+GetReplicationInfoByIdentifier(RepNodeId riident, bool missing_ok, char **riname)
+{
+ HeapTuple tuple;
+ Form_pg_replication_identifier ric;
+
+ Assert(OidIsValid((Oid) riident));
+ Assert(riident != InvalidRepNodeId);
+ Assert(riident != DoNotReplicateRepNodeId);
+
+ tuple = SearchSysCache1(REPLIDIDENT,
+ ObjectIdGetDatum((Oid) riident));
+
+ if (HeapTupleIsValid(tuple))
+ {
+ ric = (Form_pg_replication_identifier) GETSTRUCT(tuple);
+ *riname = pstrdup(text_to_cstring(&ric->riname));
+ }
+
+ if (!HeapTupleIsValid(tuple) && !missing_ok)
+ elog(ERROR, "cache lookup failed for replication identifier id: %u",
+ riident);
+
+ if (HeapTupleIsValid(tuple))
+ ReleaseSysCache(tuple);
+}
+
+static void
+CheckReplicationIdentifierPrerequisites(bool check_slots)
+{
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("only superusers can query or manipulate replication identifiers")));
+
+ if (check_slots && max_replication_slots == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot query or manipulate replication identifiers when max_replication_slots = 0")));
+
+}
+
+Datum
+pg_replication_identifier_get(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ riident = GetReplicationIdentifier(name, true);
+
+ pfree(name);
+
+ if (OidIsValid(riident))
+ PG_RETURN_OID(riident);
+ PG_RETURN_NULL();
+}
+
+
+Datum
+pg_replication_identifier_create(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ riident = CreateReplicationIdentifier(name);
+
+ pfree(name);
+
+ PG_RETURN_OID(riident);
+}
+
+Datum
+pg_replication_identifier_setup_replaying_from(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId origin;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ origin = GetReplicationIdentifier(name, false);
+ SetupCachedReplicationIdentifier(origin);
+
+ replication_origin_id = origin;
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_is_replaying(PG_FUNCTION_ARGS)
+{
+ CheckReplicationIdentifierPrerequisites(true);
+
+ PG_RETURN_BOOL(replication_origin_id != InvalidRepNodeId);
+}
+
+Datum
+pg_replication_identifier_reset_replaying_from(PG_FUNCTION_ARGS)
+{
+ CheckReplicationIdentifierPrerequisites(true);
+
+ TeardownCachedReplicationIdentifier();
+
+ replication_origin_id = InvalidRepNodeId;
+
+ PG_RETURN_VOID();
+}
+
+
+Datum
+pg_replication_identifier_setup_tx_origin(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr location = PG_GETARG_LSN(0);
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ if (local_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("need to setup the origin id first")));
+
+ replication_origin_lsn = location;
+ replication_origin_timestamp = PG_GETARG_TIMESTAMPTZ(1);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_get_replication_identifier_progress(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ int i;
+#define REPLICATION_IDENTIFIER_PROGRESS_COLS 4
+
+ /* we we want to return 0 rows if slot is set to zero */
+ CheckReplicationIdentifierPrerequisites(false);
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not allowed in this context")));
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (tupdesc->natts != REPLICATION_IDENTIFIER_PROGRESS_COLS)
+ elog(ERROR, "wrong function definition");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ /* prevent slots from being concurrently dropped */
+ LockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ /*
+ * Iterate through all possible ReplicationStates, display if they are
+ * filled. Note that we do not take any locks, so slightly corrupted/out
+ * of date values are a possibility.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state;
+ Datum values[REPLICATION_IDENTIFIER_PROGRESS_COLS];
+ bool nulls[REPLICATION_IDENTIFIER_PROGRESS_COLS];
+ char *riname;
+
+ state = &ReplicationStates[i];
+
+ /* unused slot, nothing to display */
+ if (state->local_identifier == InvalidRepNodeId)
+ continue;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ values[ 0] = ObjectIdGetDatum(state->local_identifier);
+
+ GetReplicationInfoByIdentifier(state->local_identifier, true, &riname);
+
+ /*
+ * We're not preventing the identifier to be dropped concurrently, so
+ * silently accept that it might be gone.
+ */
+ if (!riname)
+ continue;
+
+ values[ 1] = CStringGetTextDatum(riname);
+
+ values[ 2] = LSNGetDatum(state->remote_lsn);
+
+ values[ 3] = LSNGetDatum(state->local_lsn);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ tuplestore_donestoring(tupstore);
+
+ UnlockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+#undef REPLICATION_IDENTIFIER_PROGRESS_COLS
+
+ return (Datum) 0;
+}
+
+Datum
+pg_replication_identifier_advance(PG_FUNCTION_ARGS)
+{
+ text *name = PG_GETARG_TEXT_P(0);
+ XLogRecPtr remote_commit = PG_GETARG_LSN(1);
+ XLogRecPtr local_commit = PG_GETARG_LSN(2);
+ RepNodeId node;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ /* lock to prevent the replication identifier from vanishing */
+ LockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ node = GetReplicationIdentifier(text_to_cstring(name), false);
+
+ AdvanceReplicationIdentifier(node, remote_commit, local_commit);
+
+ UnlockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_drop(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+
+ riident = GetReplicationIdentifier(name, false);
+ Assert(OidIsValid(riident));
+
+ DropReplicationIdentifier(riident);
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+Size
+ReplicationIdentifierShmemSize(void)
+{
+ Size size = 0;
+
+ /*
+ * FIXME: max_replication_slots is the wrong thing to use here, here we keep
+ * the replay state of *remote* transactions.
+ */
+ if (max_replication_slots == 0)
+ return size;
+
+ size = add_size(size,
+ mul_size(max_replication_slots, sizeof(ReplicationState)));
+ return size;
+}
+
+void
+ReplicationIdentifierShmemInit(void)
+{
+ bool found;
+
+ if (max_replication_slots == 0)
+ return;
+
+ ReplicationStates = (ReplicationState *)
+ ShmemInitStruct("ReplicationIdentifierState",
+ ReplicationIdentifierShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ MemSet(ReplicationStates, 0, ReplicationIdentifierShmemSize());
+ }
+}
+
+/* ---------------------------------------------------------------------------
+ * Perform a checkpoint of replication identifier's progress with respect to
+ * the replayed remote_lsn. Make sure that all transactions we refer to in the
+ * checkpoint (local_lsn) are actually on-disk. This might not yet be the case
+ * if the transactions were originally committed asynchronously.
+ *
+ * We store checkpoints in the following format:
+ * +-------+------------------------+------------------+-----+--------+
+ * | MAGIC | ReplicationStateOnDisk | struct Replic... | ... | CRC32C | EOF
+ * +-------+------------------------+------------------+-----+--------+
+ *
+ * So its just the magic, followed by the statically sized
+ * ReplicationStateOnDisk structs. Note that the maximum number of
+ * ReplicationStates is determined by max_replication_slots.
+ *
+ * FIXME: Add a CRC32 to the end.
+ * ---------------------------------------------------------------------------
+ */
+void
+CheckPointReplicationIdentifier(XLogRecPtr ckpt)
+{
+ char tmppath[MAXPGPATH];
+ char path[MAXPGPATH];
+ int fd;
+ int tmpfd;
+ int i;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ pg_crc32 crc;
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ /*
+ * Write to a filename a LSN of the checkpoint's REDO pointer, so we can
+ * deal with the checkpoint failing after
+ * CheckPointReplicationIdentifier() finishing.
+ */
+ sprintf(path, "pg_logical/checkpoints/%X-%X.ckpt",
+ (uint32)(ckpt >> 32), (uint32)ckpt);
+ sprintf(tmppath, "pg_logical/checkpoints/%X-%X.ckpt.tmp",
+ (uint32)(ckpt >> 32), (uint32)ckpt);
+
+ /* check whether file already exists */
+ fd = OpenTransientFile(path,
+ O_RDONLY | PG_BINARY,
+ 0);
+
+ /* usual case, no checkpoint performed yet */
+ if (fd < 0 && errno == ENOENT)
+ ;
+ else if (fd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not check replication state checkpoint \"%s\": %m",
+ path)));
+ /* already checkpointed before crash during a checkpoint or so */
+ else
+ {
+ CloseTransientFile(fd);
+ return;
+ }
+
+ /* make sure no old temp file is remaining */
+ if (unlink(tmppath) < 0 && errno != ENOENT)
+ ereport(PANIC, (errmsg("failed while unlinking %s", path)));
+
+ /*
+ * no other backend can perform this at the same time, we're protected by
+ * CheckpointLock.
+ */
+ tmpfd = OpenTransientFile(tmppath,
+ O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
+ S_IRUSR | S_IWUSR);
+ if (tmpfd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not create replication identifier checkpoint \"%s\": %m",
+ tmppath)));
+
+ /* write magic */
+ if ((write(tmpfd, &magic, sizeof(magic))) !=
+ sizeof(magic))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write replication identifier checkpoint \"%s\": %m",
+ tmppath)));
+ }
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ /* write actual data */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationStateOnDisk disk_state;
+
+ /* XXX: Locking */
+
+ if (ReplicationStates[i].local_identifier == InvalidRepNodeId)
+ continue;
+
+ disk_state.local_identifier = ReplicationStates[i].local_identifier;
+ disk_state.remote_lsn = ReplicationStates[i].remote_lsn;
+
+ /* make sure we only write out a commit that's persistent */
+ XLogFlush(ReplicationStates[i].local_lsn);
+
+ if ((write(tmpfd, &disk_state, sizeof(disk_state))) !=
+ sizeof(disk_state))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write replication identifier checkpoint \"%s\": %m",
+ tmppath)));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+ }
+
+ /* write out the CRC */
+ FIN_CRC32C(crc);
+ if ((write(tmpfd, &crc, sizeof(crc))) !=
+ sizeof(crc))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write replication identifier checkpoint \"%s\": %m",
+ tmppath)));
+ }
+
+ /* fsync the file */
+ if (pg_fsync(tmpfd) != 0)
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync replication identifier checkpoint \"%s\": %m",
+ tmppath)));
+ }
+
+ CloseTransientFile(tmpfd);
+
+ /* rename to permanent file, fsync file and directory */
+ if (rename(tmppath, path) != 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not rename replication identifier checkpoint from \"%s\" to \"%s\": %m",
+ tmppath, path)));
+ }
+
+ fsync_fname("pg_logical/checkpoints", true);
+ fsync_fname(path, false);
+}
+
+/*
+ * Remove old replication identifier checkpoints that cannot possibly be
+ * needed anymore for crash recovery.
+ */
+void
+TruncateReplicationIdentifier(XLogRecPtr cutoff)
+{
+ DIR *snap_dir;
+ struct dirent *snap_de;
+ char path[MAXPGPATH];
+
+ snap_dir = AllocateDir("pg_logical/checkpoints");
+ while ((snap_de = ReadDir(snap_dir, "pg_logical/checkpoints")) != NULL)
+ {
+ uint32 hi;
+ uint32 lo;
+ XLogRecPtr lsn;
+ struct stat statbuf;
+
+ if (strcmp(snap_de->d_name, ".") == 0 ||
+ strcmp(snap_de->d_name, "..") == 0)
+ continue;
+
+ snprintf(path, MAXPGPATH, "pg_logical/checkpoints/%s", snap_de->d_name);
+
+ if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+ {
+ elog(DEBUG1, "only regular files expected: %s", path);
+ continue;
+ }
+
+ if (sscanf(snap_de->d_name, "%X-%X.ckpt", &hi, &lo) != 2)
+ {
+ ereport(LOG,
+ (errmsg("could not parse filename \"%s\"", path)));
+ continue;
+ }
+
+ lsn = ((uint64) hi) << 32 | lo;
+
+ /* check whether we still need it */
+ if (lsn < cutoff)
+ {
+ elog(DEBUG2, "removing replication identifier checkpoint %s", path);
+
+ /*
+ * It's not particularly harmful, though strange, if we can't
+ * remove the file here. Don't prevent the checkpoint from
+ * completing, that'd be cure worse than the disease.
+ */
+ if (unlink(path) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not unlink file \"%s\": %m",
+ path)));
+ continue;
+ }
+ }
+ else
+ {
+ elog(DEBUG2, "keeping replication identifier checkpoint %s", path);
+ }
+ }
+ FreeDir(snap_dir);
+}
+
+/*
+ * Recover replication replay status from checkpoint data saved earlier by
+ * CheckPointReplicationIdentifier.
+ *
+ * This only needs to be called at startup and *not* during every checkpoint
+ * read during recovery (e.g. in HS or PITR from a base backup) afterwards. All
+ * state thereafter can be recovered by looking at commit records.
+ */
+void
+StartupReplicationIdentifier(XLogRecPtr ckpt)
+{
+ char path[MAXPGPATH];
+ int fd;
+ int readBytes;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ int last_state = 0;
+ pg_crc32 file_crc;
+ pg_crc32 crc;
+
+ /* don't want to overwrite already existing state */
+#ifdef USE_ASSERT_CHECKING
+ static bool already_started = false;
+ Assert(!already_started);
+ already_started = true;
+#endif
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ elog(LOG, "starting up replication identifier with ckpt at %X/%X",
+ (uint32)(ckpt >> 32), (uint32)ckpt);
+
+ sprintf(path, "pg_logical/checkpoints/%X-%X.ckpt",
+ (uint32)(ckpt >> 32), (uint32)ckpt);
+
+ fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+
+ /*
+ * might have had max_replication_slots == 0 last run, or we just brought up a
+ * standby.
+ */
+ if (fd < 0 && errno == ENOENT)
+ return;
+ else if (fd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not open replication state checkpoint \"%s\": %m",
+ path)));
+
+ /* verify magic, thats written even if nothing was active */
+ readBytes = read(fd, &magic, sizeof(magic));
+ if (readBytes != sizeof(magic))
+ ereport(PANIC,
+ (errmsg("could not read replication state checkpoint magic \"%s\": %m",
+ path)));
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ if (magic != REPLICATION_STATE_MAGIC)
+ ereport(PANIC,
+ (errmsg("replication checkpoint has wrong magic %u instead of %u",
+ magic, REPLICATION_STATE_MAGIC)));
+
+ /* recover individual states, until there are no more to be found */
+ while (true)
+ {
+ ReplicationStateOnDisk disk_state;
+
+ readBytes = read(fd, &disk_state, sizeof(disk_state));
+
+ /* no further data */
+ if (readBytes == sizeof(crc))
+ {
+ /* not pretty, but simple ... */
+ file_crc = *(pg_crc32*) &disk_state;
+ break;
+ }
+
+ if (readBytes < 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read replication checkpoint file \"%s\": %m",
+ path)));
+ }
+
+ if (readBytes != sizeof(disk_state))
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read replication checkpoint file \"%s\": read %d of %zu",
+ path, readBytes, sizeof(disk_state))));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+
+ if (last_state == max_replication_slots)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state could be found, increase max_replication_slots")));
+
+ /* copy data to shared memory */
+ ReplicationStates[last_state].local_identifier = disk_state.local_identifier;
+ ReplicationStates[last_state].remote_lsn = disk_state.remote_lsn;
+ last_state++;
+
+ elog(LOG, "recovered replication state of node %u to %X/%X",
+ disk_state.local_identifier,
+ (uint32)(disk_state.remote_lsn >> 32),
+ (uint32)disk_state.remote_lsn);
+ }
+
+ /* now check checksum */
+ FIN_CRC32C(crc);
+ if (file_crc != crc)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("replication_slot_checkpoint has wrong checksum %u, expected %u",
+ crc, file_crc)));
+
+ CloseTransientFile(fd);
+}
+
+/*
+ * Tell the replication identifier machinery that a commit from 'node' that
+ * originated at the LSN remote_commit on the remote node was replayed
+ * successfully and that we don't need to do so again. In combination with
+ * setting up replication_origin_lsn and replication_origin_id that ensures we
+ * won't loose knowledge about that after a crash if the the transaction had a
+ * persistent effect (think of asynchronous commits).
+ *
+ * local_commit needs to be a local LSN of the commit so that we can make sure
+ * uppon a checkpoint that enough WAL has been persisted to disk.
+ *
+ * Needs to be called with a RowExclusiveLock on pg_replication_identifier,
+ * unless running in recovery.
+ */
+void
+AdvanceReplicationIdentifier(RepNodeId node,
+ XLogRecPtr remote_commit,
+ XLogRecPtr local_commit)
+{
+ int i;
+ int free_slot = -1;
+ ReplicationState *replication_state = NULL;
+
+ Assert(node != InvalidRepNodeId);
+
+ /* we don't track DoNotReplicateRepNodeId */
+ if (node == DoNotReplicateRepNodeId)
+ return;
+
+ /*
+ * XXX: should we restore into a hashtable and dump into shmem only after
+ * recovery finished?
+ */
+
+ /* check whether slot already exists */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &ReplicationStates[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->local_identifier == InvalidRepNodeId &&
+ free_slot == -1)
+ {
+ free_slot = i;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->local_identifier != node)
+ continue;
+
+ if (curstate->acquired_by != 0)
+ {
+ elog(ERROR, "cannot advance slot that is setup in backend %d",
+ curstate->acquired_by);
+ }
+
+ /* ok, found slot */
+ replication_state = curstate;
+ break;
+ }
+
+ if (replication_state == NULL && free_slot == -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state could be found for %u, increase max_replication_slots",
+ node)));
+ /* initialize new slot */
+ else if (replication_state == NULL)
+ {
+ replication_state = &ReplicationStates[free_slot];
+ Assert(replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(replication_state->local_lsn == InvalidXLogRecPtr);
+ replication_state->local_identifier = node;
+ }
+
+ Assert(replication_state->local_identifier != InvalidRepNodeId);
+
+ /*
+ * Due to - harmless - race conditions during a checkpoint we could see
+ * values here that are older than the ones we already have in
+ * memory. Don't overwrite those.
+ */
+ if (replication_state->remote_lsn < remote_commit)
+ replication_state->remote_lsn = remote_commit;
+ if (replication_state->local_lsn < local_commit)
+ replication_state->local_lsn = local_commit;
+}
+
+/*
+ * Tear down a (possibly) cached replication identifier during process exit.
+ */
+static void
+ReplicationIdentifierExitCleanup(int code, Datum arg)
+{
+ if (local_replication_state != NULL &&
+ local_replication_state->acquired_by == MyProcPid)
+ {
+ local_replication_state->acquired_by = 0;
+ local_replication_state = NULL;
+ }
+}
+
+/*
+ * Setup a replication identifier in the shared memory struct if it doesn't
+ * already exists and cache access to the specific ReplicationSlot so the
+ * array doesn't have to be searched when calling
+ * AdvanceCachedReplicationIdentifier().
+ *
+ * Obviously only one such cached identifier can exist per process and the
+ * current cached value can only be set again after the previous value is torn
+ * down with TeardownCachedReplicationIdentifier().
+ */
+void
+SetupCachedReplicationIdentifier(RepNodeId node)
+{
+ static bool registered_cleanup;
+ int i;
+ int free_slot = -1;
+
+ if (!registered_cleanup)
+ {
+ on_shmem_exit(ReplicationIdentifierExitCleanup, 0);
+ registered_cleanup = true;
+ }
+
+ Assert(max_replication_slots > 0);
+
+ if (local_replication_state != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot setup replication origin when one is already setup")));
+
+ LockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ /*
+ * Search for either an existing slot for that identifier or a free one we
+ * can use.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &ReplicationStates[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->local_identifier == InvalidRepNodeId &&
+ free_slot == -1)
+ {
+ free_slot = i;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->local_identifier != node)
+ continue;
+
+ else if (curstate->acquired_by != 0)
+ {
+ elog(ERROR, "cannot setup slot that is already setup in backend %d",
+ curstate->acquired_by);
+ }
+
+ local_replication_state = curstate;
+ }
+
+
+ if (local_replication_state == NULL && free_slot == -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state could be found for %u, increase max_replication_slots",
+ node)));
+ else if (local_replication_state == NULL)
+ {
+ local_replication_state = &ReplicationStates[free_slot];
+ Assert(local_replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(local_replication_state->local_lsn == InvalidXLogRecPtr);
+ local_replication_state->local_identifier = node;
+ }
+
+ Assert(local_replication_state->local_identifier != InvalidRepNodeId);
+
+ local_replication_state->acquired_by = MyProcPid;
+
+ UnlockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+}
+
+/*
+ * Make currently cached replication identifier unavailable so a new one can
+ * be setup with SetupCachedReplicationIdentifier().
+ *
+ * This function may only be called if a previous identifier was setup with
+ * SetupCachedReplicationIdentifier().
+ */
+void
+TeardownCachedReplicationIdentifier(void)
+{
+ Assert(max_replication_slots != 0);
+
+ if (local_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("no replication identifier is set up")));
+
+ local_replication_state->acquired_by = 0;
+ local_replication_state = NULL;
+}
+
+/*
+ * Do the same work AdvanceReplicationIdentifier() does, just on a pre-cached
+ * identifier. This is noticeably cheaper if you only ever work on a single
+ * replication identifier.
+ */
+void
+AdvanceCachedReplicationIdentifier(XLogRecPtr remote_commit,
+ XLogRecPtr local_commit)
+{
+ Assert(local_replication_state != NULL);
+ Assert(local_replication_state->local_identifier != InvalidRepNodeId);
+
+ if (local_replication_state->local_lsn < local_commit)
+ local_replication_state->local_lsn = local_commit;
+ if (local_replication_state->remote_lsn < remote_commit)
+ local_replication_state->remote_lsn = remote_commit;
+}
+
+/*
+ * Ask the machinery about the point up to which we successfully replayed
+ * changes from a already setup & cached replication identifier.
+ */
+XLogRecPtr
+RemoteCommitFromCachedReplicationIdentifier(void)
+{
+ Assert(local_replication_state != NULL);
+ return local_replication_state->remote_lsn;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 16b9808..e927698 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "replication/slot.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "replication/replication_identifier.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/ipc.h"
@@ -132,6 +133,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
size = add_size(size, CheckpointerShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
+ size = add_size(size, ReplicationIdentifierShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -238,6 +240,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
CheckpointerShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
+ ReplicationIdentifierShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index bd27168..fdccb95 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -54,6 +54,7 @@
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
+#include "catalog/pg_replication_identifier.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_tablespace.h"
#include "catalog/pg_ts_config.h"
@@ -620,6 +621,28 @@ static const struct cachedesc cacheinfo[] = {
},
128
},
+ {ReplicationIdentifierRelationId, /* REPLIDIDENT */
+ ReplicationLocalIdentIndex,
+ 1,
+ {
+ Anum_pg_replication_riident,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
+ {ReplicationIdentifierRelationId, /* REPLIDREMOTE */
+ ReplicationExternalIdentIndex,
+ 1,
+ {
+ Anum_pg_replication_riname,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
{RewriteRelationId, /* RULERELNAME */
RewriteRelRulenameIndexId,
2,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9572777..fd2d32f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -58,6 +58,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
#include "postmaster/walwriter.h"
+#include "replication/logical.h"
#include "replication/slot.h"
#include "replication/syncrep.h"
#include "replication/walreceiver.h"
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 18614e7..c2a5e15 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -202,6 +202,7 @@ static const char *subdirs[] = {
"pg_stat",
"pg_stat_tmp",
"pg_logical",
+ "pg_logical/checkpoints",
"pg_logical/snapshots",
"pg_logical/mappings"
};
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index a16089f..3ae42b8 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -55,6 +55,8 @@
#include "common/fe_memutils.h"
#include "storage/large_object.h"
#include "pg_getopt.h"
+#include "replication/logical.h"
+#include "replication/replication_identifier.h"
static ControlFileData ControlFile; /* pg_control values */
@@ -1088,6 +1090,7 @@ WriteEmptyXLOG(void)
record->xl_tot_len = SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort + sizeof(CheckPoint);
record->xl_info = XLOG_CHECKPOINT_SHUTDOWN;
record->xl_rmid = RM_XLOG_ID;
+ record->xl_origin_id = InvalidRepNodeId;
recptr += SizeOfXLogRecord;
*(recptr++) = XLR_BLOCK_ID_DATA_SHORT;
*(recptr++) = sizeof(CheckPoint);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8205504..8bc047b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,10 +146,18 @@ typedef struct xl_xact_commit
RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
+ /* xl_xact_origin follows if xinfo specifies it */
} xl_xact_commit;
#define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)
+typedef struct xl_xact_origin
+{
+ XLogRecPtr origin_lsn;
+ RepNodeId origin_node_id;
+ TimestampTz origin_timestamp;
+} xl_xact_origin;
+
/*
* These flags are set in the xinfo fields of WAL commit records,
* indicating a variety of additional actions that need to occur
@@ -160,7 +168,7 @@ typedef struct xl_xact_commit
*/
#define XACT_COMPLETION_UPDATE_RELCACHE_FILE 0x01
#define XACT_COMPLETION_FORCE_SYNC_COMMIT 0x02
-
+#define XACT_CONTAINS_ORIGIN 0x04
/* Access macros for above flags */
#define XactCompletionRelcacheInitFileInval(xinfo) (xinfo & XACT_COMPLETION_UPDATE_RELCACHE_FILE)
#define XactCompletionForceSyncCommit(xinfo) (xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 138deaf..f06d11f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -85,6 +85,7 @@ typedef enum
} RecoveryTargetType;
extern XLogRecPtr XactLastRecEnd;
+extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd;
extern bool reachedConsistency;
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 6638c1d..bd8dd70 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -45,6 +45,12 @@ typedef uint64 XLogSegNo;
typedef uint32 TimeLineID;
/*
+ * Denotes the node on which the action causing a wal record to be logged
+ * originated on.
+ */
+typedef uint16 RepNodeId;
+
+/*
* Because O_DIRECT bypasses the kernel buffers, and because we never
* read those buffers except during crash recovery or if wal_level != minimal,
* it is a win to use it in all cases where we sync on each write(). We could
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 74bec20..ef05879 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -125,6 +125,10 @@ struct XLogReaderState
uint32 main_data_len; /* main data portion's length */
uint32 main_data_bufsz; /* allocated size of the buffer */
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ RepNodeId record_origin;
+#endif
+
/* information about blocks referenced by the record. */
DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
@@ -184,6 +188,11 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+#define XLogRecGetOrigin(decoder) ((decoder)->decoded_record->xl_origin_id)
+#else
+#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#endif
#define XLogRecGetData(decoder) ((decoder)->main_data)
#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 25a9265..048e45f 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -45,7 +45,7 @@ typedef struct XLogRecord
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
- /* 2 bytes of padding here, initialize to zero */
+ RepNodeId xl_origin_id; /* what node did originally cause this record to be written */
pg_crc32 xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
@@ -174,5 +174,8 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+#define XLR_BLOCK_ID_ORIGIN 253
+#endif
#endif /* XLOGRECORD_H */
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index a680229..405528d 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -305,6 +305,12 @@ DECLARE_UNIQUE_INDEX(pg_policy_oid_index, 3257, on pg_policy using btree(oid oid
DECLARE_UNIQUE_INDEX(pg_policy_polrelid_polname_index, 3258, on pg_policy using btree(polrelid oid_ops, polname name_ops));
#define PolicyPolrelidPolnameIndexId 3258
+DECLARE_UNIQUE_INDEX(pg_replication_identifier_riiident_index, 6001, on pg_replication_identifier using btree(riident oid_ops));
+#define ReplicationLocalIdentIndex 6001
+
+DECLARE_UNIQUE_INDEX(pg_replication_identifier_riname_index, 6002, on pg_replication_identifier using btree(riname varchar_pattern_ops));
+#define ReplicationExternalIdentIndex 6002
+
/* last step of initialization script: build the indexes declared above */
BUILD_INDICES
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 9edfdb8..3765b38 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5143,6 +5143,34 @@ DESCR("rank of hypothetical row without gaps");
DATA(insert OID = 3993 ( dense_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
DESCR("aggregate final function");
+/* replication_identifier.h */
+DATA(insert OID = 6003 ( pg_replication_identifier_create PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_create _null_ _null_ _null_ ));
+DESCR("create local replication identifier for the passed external one");
+
+DATA(insert OID = 6004 ( pg_replication_identifier_get PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_get _null_ _null_ _null_ ));
+DESCR("translate the external node identifier to a local one");
+
+DATA(insert OID = 6005 ( pg_replication_identifier_setup_replaying_from PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_setup_replaying_from _null_ _null_ _null_ ));
+DESCR("setup from which node we are replaying transactions from currently");
+
+DATA(insert OID = 6006 ( pg_replication_identifier_reset_replaying_from PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_replication_identifier_reset_replaying_from _null_ _null_ _null_ ));
+DESCR("reset replay mode");
+
+DATA(insert OID = 6007 ( pg_replication_identifier_setup_tx_origin PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "3220 1184" _null_ _null_ _null_ _null_ pg_replication_identifier_setup_tx_origin _null_ _null_ _null_ ));
+DESCR("setup transaction timestamp and origin lsn");
+
+DATA(insert OID = 6008 ( pg_get_replication_identifier_progress PGNSP PGUID 12 1 100 0 0 f f f f f t v 0 0 2249 "" "{26,25,3220,3220}" "{o,o,o,o}" "{local_id, external_id, remote_lsn, local_lsn}" _null_ pg_get_replication_identifier_progress _null_ _null_ _null_ ));
+DESCR("replication identifier progress");
+
+DATA(insert OID = 6009 ( pg_replication_identifier_is_replaying PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_replication_identifier_is_replaying _null_ _null_ _null_ ));
+DESCR("is a replication identifier setup");
+
+DATA(insert OID = 6010 ( pg_replication_identifier_advance PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2278 "25 3220 3220" _null_ _null_ _null_ _null_ pg_replication_identifier_advance _null_ _null_ _null_ ));
+DESCR("advance replication itentifier to specific location");
+
+DATA(insert OID = 6011 ( pg_replication_identifier_drop PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_drop _null_ _null_ _null_ ));
+DESCR("drop existing replication identifier");
+
/*
* Symbolic values for provolatile column: these indicate whether the result
diff --git a/src/include/catalog/pg_replication_identifier.h b/src/include/catalog/pg_replication_identifier.h
new file mode 100644
index 0000000..26eec17
--- /dev/null
+++ b/src/include/catalog/pg_replication_identifier.h
@@ -0,0 +1,75 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_replication_identifier.h
+ * Persistent Replication Node Identifiers
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/pg_replication_identifier.h
+ *
+ * NOTES
+ * the genbki.pl script reads this file and generates .bki
+ * information from the DATA() statements.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_REPLICATION_IDENTIFIER_H
+#define PG_REPLICATION_IDENTIFIER_H
+
+#include "catalog/genbki.h"
+#include "access/xlogdefs.h"
+
+/* ----------------
+ * pg_replication_identifier. cpp turns this into
+ * typedef struct FormData_pg_replication_identifier
+ * ----------------
+ */
+#define ReplicationIdentifierRelationId 6000
+
+CATALOG(pg_replication_identifier,6000) BKI_SHARED_RELATION BKI_WITHOUT_OIDS
+{
+ /*
+ * locally known identifier that gets included into wal.
+ *
+ * This should never leave the system.
+ *
+ * Needs to fit into a uint16, so we don't waste too much space in WAL
+ * records. For this reason we don't use a normal Oid column here, since
+ * we need to handle allocation of new values manually.
+ */
+ Oid riident;
+
+ /*
+ * Variable-length fields start here, but we allow direct access to
+ * riname.
+ */
+
+ /* external, free-format, identifier */
+ text riname;
+#ifdef CATALOG_VARLEN /* further variable-length fields */
+#endif
+} FormData_pg_replication_identifier;
+
+/* ----------------
+ * Form_pg_extension corresponds to a pointer to a tuple with
+ * the format of pg_extension relation.
+ * ----------------
+ */
+typedef FormData_pg_replication_identifier *Form_pg_replication_identifier;
+
+/* ----------------
+ * compiler constants for pg_replication_identifier
+ * ----------------
+ */
+
+#define Natts_pg_replication_identifier 2
+#define Anum_pg_replication_riident 1
+#define Anum_pg_replication_riname 2
+
+/* ----------------
+ * pg_replication_identifier has no initial contents
+ * ----------------
+ */
+
+#endif /* PG_REPLICTION_IDENTIFIER_H */
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 5cfc0ae..c787523 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -265,6 +265,12 @@
#endif
/*
+ * Temporary switch to change between using xlog padding or a separate block
+ * id in the record to record the xlog origin of a record.
+ */
+/* #define REPLICATION_IDENTIFIER_REUSE_PADDING */
+
+/*
* Define this to cause palloc()'d memory to be filled with random data, to
* facilitate catching code that depends on the contents of uninitialized
* memory. Caution: this is horrendously expensive.
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 0935c1b..26095b1 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -74,6 +74,13 @@ typedef void (*LogicalDecodeCommitCB) (
XLogRecPtr commit_lsn);
/*
+ * Filter changes by origin.
+ */
+typedef bool (*LogicalDecodeFilterByOriginCB) (
+ struct LogicalDecodingContext *,
+ RepNodeId origin_id);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (
@@ -89,6 +96,7 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5a1d9a0..784abd6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -64,6 +64,8 @@ typedef struct ReorderBufferChange
/* The type of change. */
enum ReorderBufferChangeType action;
+ RepNodeId origin_id;
+
/*
* Context data for the change, which part of the union is valid depends
* on action/action_internal.
@@ -162,6 +164,10 @@ typedef struct ReorderBufferTXN
*/
XLogRecPtr restart_decoding_lsn;
+ /* origin of the change that caused this transaction */
+ RepNodeId origin_id;
+ XLogRecPtr origin_lsn;
+
/*
* Commit time, only known when we read the actual commit record.
*/
@@ -335,7 +341,7 @@ void ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
void ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time);
+ TimestampTz commit_time, RepNodeId origin_id, XLogRecPtr origin_lsn);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
diff --git a/src/include/replication/replication_identifier.h b/src/include/replication/replication_identifier.h
new file mode 100644
index 0000000..36d74aa
--- /dev/null
+++ b/src/include/replication/replication_identifier.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ * replication_identifier.h
+ * XXX
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef REPLICATION_IDENTIFIER_H
+#define REPLICATION_IDENTIFIER_H
+
+#include "catalog/pg_replication_identifier.h"
+#include "replication/logical.h"
+
+#define InvalidRepNodeId 0
+#define DoNotReplicateRepNodeId USHRT_MAX
+
+extern PGDLLIMPORT RepNodeId replication_origin_id;
+extern PGDLLIMPORT XLogRecPtr replication_origin_lsn;
+extern PGDLLIMPORT TimestampTz replication_origin_timestamp;
+
+/* API for querying & manipulating replication identifiers */
+extern RepNodeId GetReplicationIdentifier(char *name, bool missing_ok);
+extern RepNodeId CreateReplicationIdentifier(char *name);
+extern void GetReplicationInfoByIdentifier(RepNodeId riident, bool missing_ok,
+ char **riname);
+extern void DropReplicationIdentifier(RepNodeId riident);
+
+extern void AdvanceReplicationIdentifier(RepNodeId node,
+ XLogRecPtr remote_commit,
+ XLogRecPtr local_commit);
+extern void AdvanceCachedReplicationIdentifier(XLogRecPtr remote_commit,
+ XLogRecPtr local_commit);
+extern void SetupCachedReplicationIdentifier(RepNodeId node);
+extern void TeardownCachedReplicationIdentifier(void);
+extern XLogRecPtr RemoteCommitFromCachedReplicationIdentifier(void);
+
+/* crash recovery support */
+extern void CheckPointReplicationIdentifier(XLogRecPtr ckpt);
+extern void TruncateReplicationIdentifier(XLogRecPtr cutoff);
+extern void StartupReplicationIdentifier(XLogRecPtr ckpt);
+
+/* internals */
+extern Size ReplicationIdentifierShmemSize(void);
+extern void ReplicationIdentifierShmemInit(void);
+
+/* SQL callable functions */
+extern Datum pg_replication_identifier_get(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_create(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_drop(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_setup_replaying_from(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_reset_replaying_from(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_is_replaying(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_setup_tx_origin(PG_FUNCTION_ARGS);
+extern Datum pg_get_replication_identifier_progress(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_advance(PG_FUNCTION_ARGS);
+
+#endif
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index ba0b090..d7be45a 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,8 @@ enum SysCacheIdentifier
RANGETYPE,
RELNAMENSP,
RELOID,
+ REPLIDIDENT,
+ REPLIDREMOTE,
RULERELNAME,
STATRELATTINH,
TABLESPACEOID,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index d50b103..5609503 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1390,6 +1390,11 @@ pg_prepared_xacts| SELECT p.transaction,
FROM ((pg_prepared_xact() p(transaction, gid, prepared, ownerid, dbid)
LEFT JOIN pg_authid u ON ((p.ownerid = u.oid)))
LEFT JOIN pg_database d ON ((p.dbid = d.oid)));
+pg_replication_identifier_progress| SELECT pg_get_replication_identifier_progress.local_id,
+ pg_get_replication_identifier_progress.external_id,
+ pg_get_replication_identifier_progress.remote_lsn,
+ pg_get_replication_identifier_progress.local_lsn
+ FROM pg_get_replication_identifier_progress() pg_get_replication_identifier_progress(local_id, external_id, remote_lsn, local_lsn);
pg_replication_slots| SELECT l.slot_name,
l.plugin,
l.slot_type,
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index c7be273..400cba3 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -121,6 +121,7 @@ pg_pltemplate|t
pg_policy|t
pg_proc|t
pg_range|t
+pg_replication_identifier|t
pg_rewrite|t
pg_seclabel|t
pg_shdepend|t
--
2.0.0.rc2.4.g1dc51c6.dirty
On 02/16/2015 02:21 AM, Andres Freund wrote:
Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.
That makes sense.
How does this work if you have multiple replication systems in use in
the same cluster? You might use Slony to replication one table to one
system, and BDR to replication another table with another system. Or the
same replication software, but different hosts.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote:
On 02/16/2015 02:21 AM, Andres Freund wrote:
Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.That makes sense.
How does this work if you have multiple replication systems in use in the
same cluster? You might use Slony to replication one table to one system,
and BDR to replication another table with another system. Or the same
replication software, but different hosts.
It should "just work". Replication identifiers are identified by a free
form text, each replication solution can add the
information/distinguising data they need in there.
Bdr uses something like
#define BDR_NODE_ID_FORMAT "bdr_"UINT64_FORMAT"_%u_%u_%u_%s"
with
remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name
as parameters as a replication identifier name.
I've been wondering whether the bdr_ part in the above should be split
of into a separate field, similar to how the security label stuff does
it. But I don't think it'd really buy us much, especially as we did
not do that for logical slot names.
Each of the used replication solution would probably ask their output
plugin to only stream locally generated (i.e. origin_id =
InvalidRepNodeId) changes, and possibly from a defined list of other
known hosts in the cascading case.
Does that answer your question?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/16/2015 11:18 AM, Andres Freund wrote:
On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote:
How does this work if you have multiple replication systems in use in the
same cluster? You might use Slony to replication one table to one system,
and BDR to replication another table with another system. Or the same
replication software, but different hosts.It should "just work". Replication identifiers are identified by a free
form text, each replication solution can add the
information/distinguising data they need in there.Bdr uses something like
#define BDR_NODE_ID_FORMAT "bdr_"UINT64_FORMAT"_%u_%u_%u_%s"
with
remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name
as parameters as a replication identifier name.I've been wondering whether the bdr_ part in the above should be split
of into a separate field, similar to how the security label stuff does
it. But I don't think it'd really buy us much, especially as we did
not do that for logical slot names.Each of the used replication solution would probably ask their output
plugin to only stream locally generated (i.e. origin_id =
InvalidRepNodeId) changes, and possibly from a defined list of other
known hosts in the cascading case.Does that answer your question?
Yes, thanks. Note to self and everyone else looking at this: It's
important to keep in mind is that the replication IDs are completely
internal to the local cluster. They are *not* sent over the wire.
That's not completely true if you also use physical replication, though.
The replication IDs will be included in the WAL stream. Can you have
logical decoding running in a hot standby server? And how does the
progress report stuff that's checkpointed in pg_logical/checkpoints work
in a hot standby? (Sorry if I'm not making sense, I haven't really
thought hard about this myself)
At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is
a small price to pay, to make it work more like everything else in the
system.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
Yes, thanks. Note to self and everyone else looking at this: It's important
to keep in mind is that the replication IDs are completely internal to the
local cluster. They are *not* sent over the wire.
Well, if you *want* to, you could send the external (free form text)
replication identifiers over the wire in the output plugin. There are
scenarios where that might make sense.
That's not completely true if you also use physical replication, though. The
replication IDs will be included in the WAL stream.
Right. But then a physical rep server isn't realy a different server.
Can you have logical decoding running in a hot standby server?
Not at the moment, there's some minor stuff missing (following
timelines, enforcing tuple visibility on the primary).
And how does the progress report stuff that's checkpointed in
pg_logical/checkpoints work in a hot standby? (Sorry if I'm not
making sense, I haven't really thought hard about this myself)
It doesn't work that greatly atm, they'd need to be WAL logged for that
- which would not be hard. It'd be better to include the information in
the checkpoint, but that unfortunately doesn't really work, because we
store the checkpoint in the control file. And that has to be <=
512 bytes.
What, I guess, we could do is log it in the checkpoint, after
determining the redo pointer, and store the LSN for it in the checkpoint
record proper. That'd mean we'd read one more record at startup, but
that isn't particularly bad.
At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.
I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16/02/15 10:46, Andres Freund wrote:
On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.
I agree that limiting the overhead is important.
But I have related though about this - I wonder if it's worth to try to
map this more directly to the CommitTsNodeId. I mean it is currently
using that for the actual storage, but I am thinking of having the
Replication Identifiers be the public API and the CommitTsNodeId stuff
be just hidden implementation detail. This thought is based on the fact
that CommitTsNodeId provides somewhat meaningless number while
Replication Identifiers give us nice name for that number. And if we do
that then the type should perhaps be same for both?
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Now that the issue with padding seems to no longer exists since the
patch works both with and without padding, I went through the code and
here are some comments I have (in no particular order).
In CheckPointReplicationIdentifier:
+ * FIXME: Add a CRC32 to the end.
The function already does it (I guess you forgot to remove the comment).
Using max_replication_slots as limit for replication_identifier states
does not really make sense to me as replication_identifiers track remote
info while and slots are local and in case of master-slave replication
you need replication identifiers but don't need slots.
In bootstrap.c:
#define MARKNOTNULL(att) \ ((att)->attlen > 0 || \ (att)->atttypid == OIDVECTOROID || \ - (att)->atttypid == INT2VECTOROID) + (att)->atttypid == INT2VECTOROID || \ + strcmp(NameStr((att)->attname), "riname") == 0 \ + )
Huh? Can this be solved in a nicer way?
Since we call XLogFlush with local_lsn as parameter, shouldn't we check
that it's actually within valid range?
Currently we'll get errors like this if set to invalid value:
ERROR: xlog flush request 123/123 is not satisfied --- flushed only to
0/168FB18
In AdvanceReplicationIndentifier:
+ /* + * XXX: should we restore into a hashtable and dump into shmem only after + * recovery finished? + */
Probably no given that the function is also callable via SQL interface.
As I wrote in another email, I would like to integrate the RepNodeId and
CommitTSNodeId into single thing.
There are no docs for the new sql interfaces.
The replication_identifier.c might deserve some intro/notes text.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-22 04:59:30 +0100, Petr Jelinek wrote:
Now that the issue with padding seems to no longer exists since the patch
works both with and without padding, I went through the code and here are
some comments I have (in no particular order).In CheckPointReplicationIdentifier:
+ * FIXME: Add a CRC32 to the end.
The function already does it (I guess you forgot to remove the comment).
Yep. I locally have a WIP version that's much cleaned up and doesn't
contain it anymore.
Using max_replication_slots as limit for replication_identifier states does
not really make sense to me as replication_identifiers track remote info
while and slots are local and in case of master-slave replication you need
replication identifiers but don't need slots.
On the other hand, it's quite cheap if unused. Not sure if several
variables are worth it.
In bootstrap.c:
#define MARKNOTNULL(att) \ ((att)->attlen > 0 || \ (att)->atttypid == OIDVECTOROID || \ - (att)->atttypid == INT2VECTOROID) + (att)->atttypid == INT2VECTOROID || \ + strcmp(NameStr((att)->attname), "riname") == 0 \ + )Huh? Can this be solved in a nicer way?
Yes. I'd mentioned that this is just a temporary hack ;). I've since
pushed a more proper solution; BKI_FORCE_NOT_NULL can now be specified
on the column.
Since we call XLogFlush with local_lsn as parameter, shouldn't we check that
it's actually within valid range?
Currently we'll get errors like this if set to invalid value:
ERROR: xlog flush request 123/123 is not satisfied --- flushed only to
0/168FB18
I think we should rather remove local_lsn from all parameters that are
user callable, adding them there was a mistake. It's really only
relevant for the cases where it's called by commit.
In AdvanceReplicationIndentifier:
+ /* + * XXX: should we restore into a hashtable and dump into shmem only after + * recovery finished? + */Probably no given that the function is also callable via SQL interface.
Well, it's still a good idea regardless...
As I wrote in another email, I would like to integrate the RepNodeId and
CommitTSNodeId into single thing.
Will reply separately there.
There are no docs for the new sql interfaces.
Yea. The whole thing needs docs.
Thanks,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote:
On 16/02/15 10:46, Andres Freund wrote:
On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.I agree that limiting the overhead is important.
But I have related though about this - I wonder if it's worth to try to map
this more directly to the CommitTsNodeId.
Maybe. I'd rather go the other way round though;
replication_identifier.c/h's stuff seems much more generic than
CommitTsNodeId.
I mean it is currently using that
for the actual storage, but I am thinking of having the Replication
Identifiers be the public API and the CommitTsNodeId stuff be just hidden
implementation detail. This thought is based on the fact that CommitTsNodeId
provides somewhat meaningless number while Replication Identifiers give us
nice name for that number. And if we do that then the type should perhaps be
same for both?
I'm not sure. Given that this is included in a significant number of
recordsd I'd really rather not increase the overhead as described
above. Maybe we can just limit CommitTsNodeId to the same size for now?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22/02/15 09:57, Andres Freund wrote:
On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote:
On 16/02/15 10:46, Andres Freund wrote:
On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.I agree that limiting the overhead is important.
But I have related though about this - I wonder if it's worth to try to map
this more directly to the CommitTsNodeId.Maybe. I'd rather go the other way round though;
replication_identifier.c/h's stuff seems much more generic than
CommitTsNodeId.
Probably not more generic but definitely nicer as the nodes are named
and yes it has richer API.
I mean it is currently using that
for the actual storage, but I am thinking of having the Replication
Identifiers be the public API and the CommitTsNodeId stuff be just hidden
implementation detail. This thought is based on the fact that CommitTsNodeId
provides somewhat meaningless number while Replication Identifiers give us
nice name for that number. And if we do that then the type should perhaps be
same for both?I'm not sure. Given that this is included in a significant number of
recordsd I'd really rather not increase the overhead as described
above. Maybe we can just limit CommitTsNodeId to the same size for now?
That would make sense.
I also wonder about the default nodeid infrastructure the committs
provides. I can't think of use-case which it can be used for and which
isn't solved by replication identifiers - in the end the main reason I
added that infra was to make it possible to write something like
replication identifiers as part of an extension. In any case I don't
think the default nodeid can be used in parallel with replication
identifiers since one will overwrite the SLRU record for the other.
Maybe it's enough if this is documented but I think it might be better
if this patch removed that default committs nodeid infrastructure. It's
just few lines of code which nobody should be using yet.
Thinking about this some more and rereading the code I see that you are
setting TransactionTreeSetCommitTsData during xlog replay, but not
during transaction commit, that does not seem correct as the local
records will not have any nodeid/origin.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2/15/15 7:24 PM, Andres Freund wrote:
On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
Here's my next attept attempt at producing something we can agree
upon.The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.
I'm trying to figure out what this feature is supposed to do, but the
patch contains no documentation or a commit message. Where is one
supposed to start?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-02-22 10:03:59 -0500, Peter Eisentraut wrote:
On 2/15/15 7:24 PM, Andres Freund wrote:
On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
Here's my next attept attempt at producing something we can agree
upon.The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.I'm trying to figure out what this feature is supposed to do, but the
patch contains no documentation or a commit message. Where is one
supposed to start?
For a relatively short summary:
http://archives.postgresql.org/message-id/20150216002155.GI15326%40awork2.anarazel.de
For a longer one:
/messages/by-id/20140923182422.GA15776@alap3.anarazel.de
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
Here's the next version of this patch. I've tried to address the biggest
issue (documentation) and some more. Now that both the more flexible
commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it
looks much cleaner.
Changes:
* Loads of documentation and comments
* Revamped locking strategy. There's now a LWLock protecting all the
replication progress array and spinlock for the individual sltos.
* Simpler checkpoint format.
* A new pg_replication_identifier_progress() function returning a
individual identifier's replication progress; previously there was
only the view showing all of them.
* Lots of minor cleanup
* Some more tests
I'd greatly appreciate some feedback on the documentation. I'm not
entirely sure into how much detail to go; and where exactly in the docs
to place it. I do wonder if we shouldn't merge this with the logical
decoding section and whether we could also document commit timestamps
somewhere in there.
I've verified that this correctly works on a stanby; replication
progress is replicated correctly. I think there's two holes though:
Changing the replication progress without replicating anything and
dropping a replication identifier with some replication progress might
not work correctly. That's fairly easily fixed and I intend to do so.
Other than that I'm not aware of outstanding issues.
Comments?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Introduce-replication-identifiers-v1.0.patchtext/x-patch; charset=us-asciiDownload
>From 32ff44f3f045b0805301af8d94d1f11336c6ce8f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 15 Mar 2015 18:11:01 +0100
Subject: [PATCH] Introduce replication identifiers: v1.0
Replication identifiers are used to identify nodes in a replication
setup, identify changes that are created due to replication and to keep
track of replication progress.
Primarily this is useful because solving these in other ways is
possible, but ends up being much less efficient and more complicated. We
don't want to require replication solutions to reimplement logic for
this independently. The infrastructure is intended to be generic enough
to be reusable.
This infrastructure replaces the 'nodeid' infrastructure of commit
timestamps. Except that there's only 2^16 identifiers, the
infrastructure provided here integrates with logical replication and is
available via SQL. Since the commit timestamp infrastructure has also
been introduced in 9.5 that's not a problem.
For now the number of nodes whose replication progress can be tracked is
determined by the max_replication_slots GUC. It's not perfect to reuse
that GUC, but there doesn't seem to be sufficient reason to introduce a
separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Robert Haas, Heikki Linnakangas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
---
contrib/test_decoding/Makefile | 3 +-
contrib/test_decoding/expected/replident.out | 127 ++
contrib/test_decoding/sql/replident.sql | 58 +
contrib/test_decoding/test_decoding.c | 28 +
doc/src/sgml/catalogs.sgml | 124 ++
doc/src/sgml/filelist.sgml | 1 +
doc/src/sgml/func.sgml | 162 ++-
doc/src/sgml/logicaldecoding.sgml | 35 +-
doc/src/sgml/postgres.sgml | 1 +
doc/src/sgml/replication-identifiers.sgml | 89 ++
src/backend/access/rmgrdesc/xactdesc.c | 24 +-
src/backend/access/transam/commit_ts.c | 53 +-
src/backend/access/transam/xact.c | 69 +-
src/backend/access/transam/xlog.c | 8 +
src/backend/access/transam/xloginsert.c | 22 +-
src/backend/access/transam/xlogreader.c | 10 +
src/backend/catalog/Makefile | 2 +-
src/backend/catalog/catalog.c | 8 +-
src/backend/catalog/system_views.sql | 7 +
src/backend/replication/logical/Makefile | 3 +-
src/backend/replication/logical/decode.c | 48 +-
src/backend/replication/logical/logical.c | 33 +
src/backend/replication/logical/reorderbuffer.c | 5 +-
.../replication/logical/replication_identifier.c | 1300 ++++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/cache/syscache.c | 23 +
src/bin/pg_resetxlog/pg_resetxlog.c | 5 +
src/include/access/commit_ts.h | 14 +-
src/include/access/xact.h | 11 +
src/include/access/xlog.h | 1 +
src/include/access/xlog_internal.h | 2 +-
src/include/access/xlogdefs.h | 6 +
src/include/access/xlogreader.h | 9 +
src/include/access/xlogrecord.h | 7 +
src/include/catalog/catversion.h | 2 +-
src/include/catalog/indexing.h | 6 +
src/include/catalog/pg_proc.h | 30 +
src/include/catalog/pg_replication_identifier.h | 74 ++
src/include/pg_config_manual.h | 6 +
src/include/replication/logical.h | 2 +
src/include/replication/output_plugin.h | 8 +
src/include/replication/reorderbuffer.h | 8 +-
src/include/replication/replication_identifier.h | 62 +
src/include/storage/lwlock.h | 3 +-
src/include/utils/syscache.h | 2 +
src/test/regress/expected/rules.out | 5 +
src/test/regress/expected/sanity_check.out | 1 +
47 files changed, 2425 insertions(+), 85 deletions(-)
create mode 100644 contrib/test_decoding/expected/replident.out
create mode 100644 contrib/test_decoding/sql/replident.sql
create mode 100644 doc/src/sgml/replication-identifiers.sgml
create mode 100644 src/backend/replication/logical/replication_identifier.c
create mode 100644 src/include/catalog/pg_replication_identifier.h
create mode 100644 src/include/replication/replication_identifier.h
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 438be44..f8334cc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -37,7 +37,8 @@ submake-isolation:
submake-test_decoding:
$(MAKE) -C $(top_builddir)/contrib/test_decoding
-REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel binary prepared
+REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel \
+ binary prepared replident
regresscheck: all | submake-regress submake-test_decoding
$(MKDIR_P) regression_output
diff --git a/contrib/test_decoding/expected/replident.out b/contrib/test_decoding/expected/replident.out
new file mode 100644
index 0000000..f6dc404
--- /dev/null
+++ b/contrib/test_decoding/expected/replident.out
@@ -0,0 +1,127 @@
+-- predictability
+SET synchronous_commit = on;
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+ pg_replication_identifier_create
+----------------------------------
+ 1
+(1 row)
+
+-- ensure duplicate creations fail
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+ERROR: duplicate key value violates unique constraint "pg_replication_identifier_riname_index"
+DETAIL: Key (riname)=(test_decoding: regression_slot) already exists.
+--ensure deletions work (once)
+SELECT pg_replication_identifier_create('test_decoding: temp');
+ pg_replication_identifier_create
+----------------------------------
+ 2
+(1 row)
+
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+ pg_replication_identifier_drop
+--------------------------------
+
+(1 row)
+
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+ERROR: cache lookup failed for replication identifier named test_decoding: temp
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ BEGIN
+ table public.target_tbl: INSERT: id[integer]:1 data[text]:'BEGIN'
+ table public.target_tbl: INSERT: id[integer]:2 data[text]:'table public.origin_tbl: INSERT: id[integer]:1 data[text]:''will be replicated and decoded and decoded again'''
+ table public.target_tbl: INSERT: id[integer]:3 data[text]:'COMMIT'
+ COMMIT
+(5 rows)
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+-- mark session as replaying
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+ pg_replication_identifier_setup_replaying_from
+------------------------------------------------
+
+(1 row)
+
+-- ensure we prevent duplicate setup
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+ERROR: cannot setup replication origin when one is already setup
+BEGIN;
+-- setup transaction origins
+SELECT pg_replication_identifier_setup_tx_origin('0/ffffffff', '2013-01-01 00:00');
+ pg_replication_identifier_setup_tx_origin
+-------------------------------------------
+
+(1 row)
+
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+SELECT pg_replication_identifier_reset_replaying_from();
+ pg_replication_identifier_reset_replaying_from
+------------------------------------------------
+
+(1 row)
+
+SELECT local_id, external_id, remote_lsn, local_lsn <> '0/0' FROM pg_replication_identifier_progress;
+ local_id | external_id | remote_lsn | ?column?
+----------+--------------------------------+------------+----------
+ 1 | test_decoding: regression_slot | 0/FFFFFFFF | t
+(1 row)
+
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', false);
+ pg_replication_identifier_progress
+------------------------------------
+ 0/FFFFFFFF
+(1 row)
+
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', true);
+ pg_replication_identifier_progress
+------------------------------------
+ 0/FFFFFFFF
+(1 row)
+
+-- ensure reset requires previously setup state
+SELECT pg_replication_identifier_reset_replaying_from();
+ERROR: no replication identifier is set up
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+------
+(0 rows)
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+--------------------------------------------------------------------------------
+ BEGIN
+ table public.origin_tbl: INSERT: id[integer]:3 data[text]:'will be replicated'
+ COMMIT
+(3 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_replication_identifier_drop('test_decoding: regression_slot');
+ pg_replication_identifier_drop
+--------------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/replident.sql b/contrib/test_decoding/sql/replident.sql
new file mode 100644
index 0000000..d5ba486
--- /dev/null
+++ b/contrib/test_decoding/sql/replident.sql
@@ -0,0 +1,58 @@
+-- predictability
+SET synchronous_commit = on;
+
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+-- ensure duplicate creations fail
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+
+--ensure deletions work (once)
+SELECT pg_replication_identifier_create('test_decoding: temp');
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+
+-- mark session as replaying
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+
+-- ensure we prevent duplicate setup
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+
+BEGIN;
+-- setup transaction origins
+SELECT pg_replication_identifier_setup_tx_origin('0/ffffffff', '2013-01-01 00:00');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+
+SELECT pg_replication_identifier_reset_replaying_from();
+
+SELECT local_id, external_id, remote_lsn, local_lsn <> '0/0' FROM pg_replication_identifier_progress;
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', false);
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', true);
+
+-- ensure reset requires previously setup state
+SELECT pg_replication_identifier_reset_replaying_from();
+
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_replication_identifier_drop('test_decoding: regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 963d5df..2ec3001 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -21,6 +21,7 @@
#include "replication/output_plugin.h"
#include "replication/logical.h"
+#include "replication/replication_identifier.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -43,6 +44,7 @@ typedef struct
bool include_timestamp;
bool skip_empty_xacts;
bool xact_wrote_changes;
+ bool only_local;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +61,8 @@ static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
+static bool pg_decode_filter(LogicalDecodingContext *ctx,
+ RepNodeId origin_id);
void
_PG_init(void)
@@ -76,6 +80,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
}
@@ -97,6 +102,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_xids = true;
data->include_timestamp = false;
data->skip_empty_xacts = false;
+ data->only_local = false;
ctx->output_plugin_private = data;
@@ -155,6 +161,17 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "only-local") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->only_local = true;
+ else if (!parse_bool(strVal(elem->arg), &data->only_local))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -223,6 +240,17 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+static bool
+pg_decode_filter(LogicalDecodingContext *ctx,
+ RepNodeId origin_id)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->only_local && origin_id != InvalidRepNodeId)
+ return true;
+ return false;
+}
+
/*
* Print literal `outputstr' already represented as string of type `typid'
* into stringbuf `s'.
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index dfed546..12a66c2 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -239,6 +239,16 @@
</row>
<row>
+ <entry><link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link></entry>
+ <entry>registered replication identifiers</entry>
+ </row>
+
+ <row>
+ <entry><link linkend="catalog-pg-replication-identifier-progress"><structname>pg_replication_identifier_progress</structname></link></entry>
+ <entry>information about logical replication progress</entry>
+ </row>
+
+ <row>
<entry><link linkend="catalog-pg-replication-slots"><structname>pg_replication_slots</structname></link></entry>
<entry>replication slot information</entry>
</row>
@@ -5323,6 +5333,120 @@
</sect1>
+ <sect1 id="catalog-pg-replication-identifier">
+ <title><structname>pg_replication_identifier</structname></title>
+
+ <indexterm zone="catalog-pg-replication-identifier">
+ <primary>pg_replication_identifier</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_replication_identifier</structname> catalog
+ contains all replication identifiers created. For more on
+ replication identifiers
+ see <xref linkend="replication-identifiers">.
+ </para>
+
+ <table>
+
+ <title><structname>pg_replication_identifier</structname> Columns</title>
+
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Type</entry>
+ <entry>References</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>riident</structfield></entry>
+ <entry><type>Oid</type></entry>
+ <entry></entry>
+ <entry>A unique, cluster-wide identifier for the replication
+ identifier. Should never leave the system.</entry>
+ </row>
+
+ <row>
+ <entry><structfield>riname</structfield></entry>
+ <entry><type>text</type></entry>
+ <entry></entry>
+ <entry>The external, user defined, name of a replication
+ identifier.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
+ <sect1 id="catalog-pg-replication-identifier-progress">
+ <title><structname>pg_replication_identifier_progress</structname></title>
+
+ <indexterm zone="catalog-pg-replication-identifier-progress">
+ <primary>pg_replication_identifier_progress</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_replication_identifier_progress</structname>
+ view contains information about how far replication for a certain
+ replication identifier has progressed. For more on replication
+ identifiers see <xref linkend="replication-identifiers">.
+ </para>
+
+ <table>
+
+ <title><structname>pg_replication_identifier_progress</structname> Columns</title>
+
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Type</entry>
+ <entry>References</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>local_id</structfield></entry>
+ <entry><type>Oid</type></entry>
+ <entry><literal><link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link>.riident</literal></entry>
+ <entry>internal node identifier</entry>
+ </row>
+
+ <row>
+ <entry><structfield>external_id</structfield></entry>
+ <entry><type>text</type></entry>
+ <entry><literal><link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link>.riname</literal></entry>
+ <entry>external node identifier</entry>
+ </row>
+
+ <row>
+ <entry><structfield>remote_lsn</structfield></entry>
+ <entry><type>pg_lsn</type></entry>
+ <entry></entry>
+ <entry>The origin node's LSN up to which data has been replicated.</entry>
+ </row>
+
+
+ <row>
+ <entry><structfield>local_lsn</structfield></entry>
+ <entry><type>pg_lsn</type></entry>
+ <entry></entry>
+ <entry>This node's LSN that at
+ which <literal>remote_lsn</literal> has been replicated. Used to
+ flush commit records before persisting data to disk when using
+ asynchronous commits.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
<sect1 id="catalog-pg-replication-slots">
<title><structname>pg_replication_slots</structname></title>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 89fff77..3d7166a 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -95,6 +95,7 @@
<!ENTITY fdwhandler SYSTEM "fdwhandler.sgml">
<!ENTITY custom-scan SYSTEM "custom-scan.sgml">
<!ENTITY logicaldecoding SYSTEM "logicaldecoding.sgml">
+<!ENTITY replication-identifiers SYSTEM "replication-identifiers.sgml">
<!ENTITY protocol SYSTEM "protocol.sgml">
<!ENTITY sources SYSTEM "sources.sgml">
<!ENTITY storage SYSTEM "storage.sgml">
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index aa19e10..a4166f0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16875,9 +16875,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<para>
The functions shown in <xref linkend="functions-replication-table"> are
for controlling and interacting with replication features.
- See <xref linkend="streaming-replication">
- and <xref linkend="streaming-replication-slots"> for information about the
- underlying features. Use of these functions is restricted to superusers.
+ See <xref linkend="streaming-replication">,
+ <xref linkend="streaming-replication-slots">, <xref linkend="replication-identifiers">
+ for information about the underlying features. Use of these
+ functions is restricted to superusers.
</para>
<para>
@@ -17034,6 +17035,161 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
on future calls.
</entry>
</row>
+
+ <row id="replication-identifier-create">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_create</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_create(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ <parameter>internal_id</parameter> <type>oid</type>
+ </entry>
+ <entry>
+ Create a replication identifier based on the passed in
+ external name, and create an internal id for it.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_get</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_get(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ <parameter>internal_id</parameter> <type>oid</type>
+ </entry>
+ <entry>
+ Lookup replication identifier and return the internal id. If
+ no replication identifier is found a error is thrown.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-setup-replaying-from">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_setup_replaying_from</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_setup_replaying_from(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Signal that the current session is replaying from the passed
+ in node. This will mark changes and transactions emitted by
+ session to be marked as originating from that node. Normal
+ operation can be resumed using
+ <function>pg_replication_identifier_reset_replaying_from</function>. Can
+ only be used if no previous origin is configured.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_reset_replaying_from</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_reset_replaying_from(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Teardown configured replication identifier setup by
+ <function>pg_replication_identifier_setup_replaying_from</function>.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-setup-tx-origin">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_setup_tx_origin</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_setup_tx_origin(<parameter>origin_lsn</parameter> <type>pg_lsn</type>, <parameter>origin_timestamp</parameter> <type>timestamptz</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Mark the current transaction to be replication a transaction
+ that has committed at the passed in <acronym>LSN</acronym> and
+ timestamp. Can only be called when a replication origin has
+ previously been configured using
+ <function>pg_replication_identifier_setup_replaying_from</function>.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_is_replaying</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_is_replaying()</function></literal>
+ </entry>
+ <entry>
+ bool
+ </entry>
+ <entry>
+ Has a replication identifer been setup in the current session?
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_advance</primary>
+ </indexterm>
+ <literal>pg_replication_identifier_advance<function>(<parameter>node_name</parameter> <type>text</type>, <parameter>pos</parameter> <type>pg_lsn</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Set replication progress for the passed in node to the passed
+ in position. This primarily is useful for setting up the
+ initial position or a new position after configuration changes
+ and similar. Be aware that careless use of this function can
+ lead to inconsistently replicated data.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-drop">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_drop</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_drop(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Delete a previously created replication identifier.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-progress">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_progress</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_progress(<parameter>node_name</parameter> <type>text</type>, <parameter>flush</parameter> <type>bool</type>)</function></literal>
+ </entry>
+ <entry>
+ pg_lsn
+ </entry>
+ <entry>
+ Return the replay position for the passed in replication
+ identifier. The parameter <parameter>flush</parameter>
+ determines whether the corresponding local transaction will be
+ guaranteed to have been flushed to disk or not.
+ </entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 3650567..c84a1769 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -363,6 +363,7 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -370,7 +371,8 @@ typedef void (*LogicalOutputPluginInit)(struct OutputPluginCallbacks *cb);
</programlisting>
The <function>begin_cb</function>, <function>change_cb</function>
and <function>commit_cb</function> callbacks are required,
- while <function>startup_cb</function>
+ while <function>startup_cb</function>,
+ <function>filter_by_origin_cb</function>
and <function>shutdown_cb</function> are optional.
</para>
</sect2>
@@ -569,6 +571,37 @@ typedef void (*LogicalDecodeChangeCB) (
</para>
</note>
</sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-filter-by-origin">
+ <title>Origin Filter Callback</title>
+
+ <para>
+ The optional <function>filter_by_origin_cb</function> callback
+ is called to determine wheter data that has been replayed
+ from <parameter>origin_id</parameter> is of interest to the
+ output plugin.
+<programlisting>
+typedef bool (*LogicalDecodeChangeCB) (
+ struct LogicalDecodingContext *ctx,
+ RepNodeId origin_id
+);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. No information but the origin is
+ available. To signal that changes originating on the passed in
+ node are irrelevant, return true, causing them to be filtered
+ away; false otherwise. The other callbacks will not be called
+ for transactions and changes that have been filtered away.
+ </para>
+ <para>
+ This is useful when implementing cascading or multi directional
+ replication solutions. Filtering by the origin allows to
+ prevent replicating the same changes back and forth in such
+ setups. While transactions and changes also carry information
+ about the origin, filtering via this callback is noticeably
+ more efficient.
+ </para>
+ </sect3>
</sect2>
<sect2 id="logicaldecoding-output-plugin-output">
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index e378d69..5e2eacb 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -220,6 +220,7 @@
&spi;
&bgworker;
&logicaldecoding;
+ &replication-identifiers;
</part>
diff --git a/doc/src/sgml/replication-identifiers.sgml b/doc/src/sgml/replication-identifiers.sgml
new file mode 100644
index 0000000..707a4e5
--- /dev/null
+++ b/doc/src/sgml/replication-identifiers.sgml
@@ -0,0 +1,89 @@
+<!-- doc/src/sgml/replication-identifiers.sgml -->
+<chapter id="replication-identifiers">
+ <title>Replication Identifiers</title>
+ <indexterm zone="replication-identifiers">
+ <primary>Replication Identifiers</primary>
+ </indexterm>
+
+ <para>
+ Replication identifiers are intended to make it easier to implement
+ logical replication solutions on top
+ of <xref linkend="logicaldecoding">. They provide a solution to two
+ common problems:
+ <itemizedlist>
+ <listitem><para>How to safely keep track of replication progress</para></listitem>
+ <listitem><para>How to change replication behavior, based on the
+ origin of a row; e.g. to avoid loops in bi-directional replication
+ setups</para></listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ Replication identifiers consist out of a external name, and a
+ internal identifier. The external identifier is free-form. It should
+ be used in a way that makes conflicts between replication
+ identifiers created by different replication solutions unlikely;
+ e.g. by prefixing the replication solution's name. The internal
+ identifier is used only to avoid having to store the long version in
+ situations where space efficiency is important. It should never be
+ shared between systems.
+ </para>
+
+ <para>
+ Replication identifiers can be created using the
+ <link linkend="replication-identifier-create"><function>pg_replication_identifier_create()</function></link>;
+ dropped using
+ <link linkend="replication-identifier-drop"><function>pg_replication_identifier_drop()</function></link>;
+ and seen in the
+ <link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link>
+ catalog.
+ </para>
+
+ <para>
+ When replicating from one system to another (independent of the fact
+ that those two might be in the same cluster, or even same database)
+ one nontrivial part of building a replication solution is to keep
+ track of replication progress. When the applying process or the
+ whole cluster dies, it needs to be able to find out up to where data
+ has successfully been replicated. Naive solutions to this like
+ updating a row in a table for every replayed transaction have
+ problems like bloat.
+ </para>
+
+ <para>
+ Using the replication identifier infrastructure a session can be
+ marked as replaying from a remote node (using the
+ <link linkend="replication-identifier-setup-replaying-from"><function>pg_replication_identifier_setup_replaying_from()</function></link>
+ function. Additionally the <acronym>LSN</acronym> and commit
+ timestamp of every source transaction can be configured on a per
+ transaction basis using
+ <link linkend="replication-identifier-setup-tx-origin"><function>pg_replication_identifier_setup_tx_origin()</function></link>.
+ If that's done replication progress will be persist in a crash safe
+ manner. Replication progress for all replication identifiers can be
+ seen in the
+ <link linkend="catalog-pg-replication-identifier-progress">
+ <structname>pg_replication_progress</structname>
+ </link> view. A individual identifier's progress, e.g. when resuming
+ replication, can be acquired using
+ <link linkend="replication-identifier-progress"><function>pg_replication_identifier_progress()</function></link>
+ </para>
+
+ <para>
+ In more complex replication topologies than replication from exactly
+ one system to one other another problem can be that it's hard to
+ avoid replicating replicated rows again. That can lead both to
+ cycles in the replication and inefficiencies. Replication
+ identifiers provide a, optional, mechanism to recognize and prevent
+ that. When setup using the functions referenced in the previous
+ paragraph every change and transaction passed to output plugin
+ callbacks (see <xref linkend="logicaldecoding-output-plugin">)
+ generated by the session is tagged with the replication identifier
+ of the generating session. This allows to treat them differently in
+ the output plugin, e.g. ignoring all but locally originating rows.
+ Additionally the <link linkend="logicaldecoding-output-plugin-filter-by-origin">
+ <function>filter_by_origin_cb</function></link> callback can be used
+ to filter the logical decoding change stream based on the
+ source. While less flexible, filtering via that callback is
+ considerably more efficient.
+ </para>
+</chapter>
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index b036b6d..4df0bce 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -101,6 +101,16 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
data += sizeof(xl_xact_twophase);
}
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin *xl_origin = (xl_xact_origin *) data;
+
+ parsed->origin_lsn = xl_origin->origin_lsn;
+ parsed->origin_timestamp = xl_origin->origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
+ }
}
void
@@ -156,7 +166,7 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
}
static void
-xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec)
+xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepNodeId origin_id)
{
xl_xact_parsed_commit parsed;
int i;
@@ -218,6 +228,15 @@ xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec)
if (XactCompletionForceSyncCommit(parsed.xinfo))
appendStringInfo(buf, "; sync");
+
+ if (parsed.xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ appendStringInfo(buf, "; origin: node %u, lsn %X/%X, at %s",
+ origin_id,
+ (uint32)(parsed.origin_lsn >> 32),
+ (uint32)parsed.origin_lsn,
+ timestamptz_to_str(parsed.origin_timestamp));
+ }
}
static void
@@ -274,7 +293,8 @@ xact_desc(StringInfo buf, XLogReaderState *record)
{
xl_xact_commit *xlrec = (xl_xact_commit *) rec;
- xact_desc_commit(buf, XLogRecGetInfo(record), xlrec);
+ xact_desc_commit(buf, XLogRecGetInfo(record), xlrec,
+ XLogRecGetOrigin(record));
}
else if (info == XLOG_XACT_ABORT || info == XLOG_XACT_ABORT_PREPARED)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index dc23ab2..ffc3466 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -49,18 +49,18 @@
*/
/*
- * We need 8+4 bytes per xact. Note that enlarging this struct might mean
+ * We need 8+2 bytes per xact. Note that enlarging this struct might mean
* the largest possible file name is more than 5 chars long; see
* SlruScanDirectory.
*/
typedef struct CommitTimestampEntry
{
TimestampTz time;
- CommitTsNodeId nodeid;
+ RepNodeId nodeid;
} CommitTimestampEntry;
#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
- sizeof(CommitTsNodeId))
+ sizeof(RepNodeId))
#define COMMIT_TS_XACTS_PER_PAGE \
(BLCKSZ / SizeOfCommitTimestampEntry)
@@ -93,43 +93,18 @@ CommitTimestampShared *commitTsShared;
/* GUC variable */
bool track_commit_timestamp;
-static CommitTsNodeId default_node_id = InvalidCommitTsNodeId;
-
static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz ts,
- CommitTsNodeId nodeid, int pageno);
+ RepNodeId nodeid, int pageno);
static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
- CommitTsNodeId nodeid, int slotno);
+ RepNodeId nodeid, int slotno);
static int ZeroCommitTsPage(int pageno, bool writeXlog);
static bool CommitTsPagePrecedes(int page1, int page2);
static void WriteZeroPageXlogRec(int pageno);
static void WriteTruncateXlogRec(int pageno);
static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid);
-
-
-/*
- * CommitTsSetDefaultNodeId
- *
- * Set default nodeid for current backend.
- */
-void
-CommitTsSetDefaultNodeId(CommitTsNodeId nodeid)
-{
- default_node_id = nodeid;
-}
-
-/*
- * CommitTsGetDefaultNodeId
- *
- * Set default nodeid for current backend.
- */
-CommitTsNodeId
-CommitTsGetDefaultNodeId(void)
-{
- return default_node_id;
-}
+ RepNodeId nodeid);
/*
* TransactionTreeSetCommitTsData
@@ -156,7 +131,7 @@ CommitTsGetDefaultNodeId(void)
void
TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid, bool do_xlog)
+ RepNodeId nodeid, bool do_xlog)
{
int i;
TransactionId headxid;
@@ -234,7 +209,7 @@ TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
static void
SetXidCommitTsInPage(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz ts,
- CommitTsNodeId nodeid, int pageno)
+ RepNodeId nodeid, int pageno)
{
int slotno;
int i;
@@ -259,7 +234,7 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
- CommitTsNodeId nodeid, int slotno)
+ RepNodeId nodeid, int slotno)
{
int entryno = TransactionIdToCTsEntry(xid);
CommitTimestampEntry entry;
@@ -282,7 +257,7 @@ TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
*/
bool
TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
- CommitTsNodeId *nodeid)
+ RepNodeId *nodeid)
{
int pageno = TransactionIdToCTsPage(xid);
int entryno = TransactionIdToCTsEntry(xid);
@@ -322,7 +297,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (ts)
*ts = 0;
if (nodeid)
- *nodeid = InvalidCommitTsNodeId;
+ *nodeid = InvalidRepNodeId;
return false;
}
@@ -373,7 +348,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
* as NULL if not wanted.
*/
TransactionId
-GetLatestCommitTsData(TimestampTz *ts, CommitTsNodeId *nodeid)
+GetLatestCommitTsData(TimestampTz *ts, RepNodeId *nodeid)
{
TransactionId xid;
@@ -503,7 +478,7 @@ CommitTsShmemInit(void)
commitTsShared->xidLastCommit = InvalidTransactionId;
TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
- commitTsShared->dataLastCommit.nodeid = InvalidCommitTsNodeId;
+ commitTsShared->dataLastCommit.nodeid = InvalidRepNodeId;
}
else
Assert(found);
@@ -857,7 +832,7 @@ WriteTruncateXlogRec(int pageno)
static void
WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid)
+ RepNodeId nodeid)
{
xl_commit_ts_set record;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1495bb4..ba3fe09 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -40,8 +40,10 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/logical.h"
#include "replication/walsender.h"
#include "replication/syncrep.h"
+#include "replication/replication_identifier.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -1073,21 +1075,22 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
InvalidTransactionId /* plain commit */);
- }
- /*
- * We only need to log the commit timestamp separately if the node
- * identifier is a valid value; the commit record above already contains
- * the timestamp info otherwise, and will be used to load it.
- */
- if (markXidCommitted)
- {
- CommitTsNodeId node_id;
+ /* record plain commit ts if not replaying remote actions */
+ if (replication_origin_id == InvalidRepNodeId ||
+ replication_origin_id == DoNotReplicateRepNodeId)
+ replication_origin_timestamp = xactStopTimestamp;
+ else
+ AdvanceCachedReplicationIdentifier(replication_origin_lsn,
+ XactLastRecEnd);
- node_id = CommitTsGetDefaultNodeId();
+ /*
+ * We don't need to WAL log here, the commit record contains all the
+ * necessary information and will redo the SET action during replay.
+ */
TransactionTreeSetCommitTsData(xid, nchildren, children,
- xactStopTimestamp,
- node_id, node_id != InvalidCommitTsNodeId);
+ replication_origin_timestamp,
+ replication_origin_id, false);
}
/*
@@ -1176,9 +1179,11 @@ RecordTransactionCommit(void)
if (wrote_xlog && markXidCommitted)
SyncRepWaitForLSN(XactLastRecEnd);
+ /* remember end of last commit record */
+ XactLastCommitEnd = XactLastRecEnd;
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
-
cleanup:
/* Clean up local data */
if (rels)
@@ -4611,6 +4616,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_invals xl_invals;
xl_xact_twophase xl_twophase;
+ xl_xact_origin xl_origin;
uint8 info;
@@ -4668,6 +4674,15 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_twophase.xid = twophase_xid;
}
+ /* dump transaction origin information */
+ if (replication_origin_id != InvalidRepNodeId)
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replication_origin_lsn;
+ xl_origin.origin_timestamp = replication_origin_timestamp;
+ }
+
if (xl_xinfo.xinfo != 0)
info |= XLOG_XACT_HAS_INFO;
@@ -4709,6 +4724,9 @@ XactLogCommitRecord(TimestampTz commit_time,
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
return XLogInsert(RM_XACT_ID, info);
}
@@ -4806,10 +4824,12 @@ XactLogAbortRecord(TimestampTz abort_time,
static void
xact_redo_commit(xl_xact_parsed_commit *parsed,
TransactionId xid,
- XLogRecPtr lsn)
+ XLogRecPtr lsn,
+ RepNodeId origin_id)
{
TransactionId max_xid;
int i;
+ TimestampTz commit_time;
max_xid = TransactionIdLatest(xid, parsed->nsubxacts, parsed->subxacts);
@@ -4829,9 +4849,16 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
LWLockRelease(XidGenLock);
}
+ Assert(!!(parsed->xinfo & XACT_XINFO_HAS_ORIGIN) == (origin_id != InvalidRepNodeId));
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ commit_time = parsed->origin_timestamp;
+ else
+ commit_time = parsed->xact_time;
+
/* Set the transaction commit timestamp and metadata */
TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
- parsed->xact_time, InvalidCommitTsNodeId,
+ commit_time, origin_id,
false);
if (standbyState == STANDBY_DISABLED)
@@ -4892,6 +4919,14 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
StandbyReleaseLockTree(xid, 0, NULL);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ /* recover apply progress */
+ AdvanceReplicationIdentifier(origin_id,
+ parsed->origin_lsn,
+ lsn);
+ }
+
/* Make sure files supposed to be dropped are dropped */
if (parsed->nrels > 0)
{
@@ -5047,13 +5082,13 @@ xact_redo(XLogReaderState *record)
{
Assert(!TransactionIdIsValid(parsed.twophase_xid));
xact_redo_commit(&parsed, XLogRecGetXid(record),
- record->EndRecPtr);
+ record->EndRecPtr, XLogRecGetOrigin(record));
}
else
{
Assert(TransactionIdIsValid(parsed.twophase_xid));
xact_redo_commit(&parsed, parsed.twophase_xid,
- record->EndRecPtr);
+ record->EndRecPtr, XLogRecGetOrigin(record));
RemoveTwoPhaseFile(parsed.twophase_xid, false);
}
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2c6ae12..fc9619e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
#include "postmaster/startup.h"
#include "replication/logical.h"
#include "replication/slot.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
@@ -295,6 +296,7 @@ static TimeLineID curFileTLI;
static XLogRecPtr ProcLastRecPtr = InvalidXLogRecPtr;
XLogRecPtr XactLastRecEnd = InvalidXLogRecPtr;
+XLogRecPtr XactLastCommitEnd = InvalidXLogRecPtr;
/*
* RedoRecPtr is this backend's local copy of the REDO record pointer
@@ -6123,6 +6125,11 @@ StartupXLOG(void)
StartupMultiXact();
/*
+ * Recover knowledge about replay progress of known replication partners.
+ */
+ StartupReplicationIdentifier();
+
+ /*
* Initialize unlogged LSN. On a clean shutdown, it's restored from the
* control file. On recovery, all unlogged relations are blown away, so
* the unlogged LSN counter can be reset too.
@@ -8287,6 +8294,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointBuffers(flags); /* performs all required fsyncs */
+ CheckPointReplicationIdentifier();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 88209c3..67e38e5 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -26,6 +26,7 @@
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
#include "miscadmin.h"
+#include "replication/replication_identifier.h"
#include "storage/bufmgr.h"
#include "storage/proc.h"
#include "utils/memutils.h"
@@ -83,10 +84,16 @@ static uint32 mainrdata_len; /* total # of bytes in chain */
static XLogRecData hdr_rdt;
static char *hdr_scratch = NULL;
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+#define SizeOfXlogOrigin 0
+#else
+#define SizeOfXlogOrigin (sizeof(RepNodeId) + sizeof(XLR_BLOCK_ID_ORIGIN))
+#endif
+
#define HEADER_SCRATCH_SIZE \
(SizeOfXLogRecord + \
MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
- SizeOfXLogRecordDataHeaderLong)
+ SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
/*
* An array of XLogRecData structs, to hold registered data.
@@ -678,6 +685,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
scratch += sizeof(BlockNumber);
}
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ /* followed by the record's origin, if any */
+ if (replication_origin_id != InvalidRepNodeId)
+ {
+ *(scratch++) = XLR_BLOCK_ID_ORIGIN;
+ memcpy(scratch, &replication_origin_id, sizeof(replication_origin_id));
+ scratch += sizeof(replication_origin_id);
+ }
+#endif
+
/* followed by main data, if any */
if (mainrdata_len > 0)
{
@@ -723,6 +740,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
rechdr->xl_tot_len = total_len;
rechdr->xl_info = info;
rechdr->xl_rmid = rmid;
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+ rechdr->xl_origin_id = replication_origin_id;
+#endif
rechdr->xl_prev = InvalidXLogRecPtr;
rechdr->xl_crc = rdata_crc;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 4a51def..7368f8b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -21,6 +21,7 @@
#include "access/xlogreader.h"
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
+#include "replication/replication_identifier.h"
static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -957,6 +958,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
ResetDecoder(state);
state->decoded_record = record;
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ state->record_origin = InvalidRepNodeId;
+#endif
ptr = (char *) record;
ptr += SizeOfXLogRecord;
@@ -991,6 +995,12 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
break; /* by convention, the main data fragment is
* always last */
}
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ else if (block_id == XLR_BLOCK_ID_ORIGIN)
+ {
+ COPY_HEADER_FIELD(&state->record_origin, sizeof(RepNodeId));
+ }
+#endif
else if (block_id <= XLR_MAX_BLOCK_ID)
{
/* XLogRecordBlockHeader */
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index a403c64..5b04550 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -39,7 +39,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
pg_ts_config.h pg_ts_config_map.h pg_ts_dict.h \
pg_ts_parser.h pg_ts_template.h pg_extension.h \
pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
- pg_foreign_table.h pg_policy.h \
+ pg_foreign_table.h pg_policy.h pg_replication_identifier.h \
pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \
toasting.h indexing.h \
)
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index e9d3cdc..00c4393 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -32,6 +32,7 @@
#include "catalog/pg_namespace.h"
#include "catalog/pg_pltemplate.h"
#include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_replication_identifier.h"
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
@@ -224,7 +225,8 @@ IsSharedRelation(Oid relationId)
relationId == SharedDependRelationId ||
relationId == SharedSecLabelRelationId ||
relationId == TableSpaceRelationId ||
- relationId == DbRoleSettingRelationId)
+ relationId == DbRoleSettingRelationId ||
+ relationId == ReplicationIdentifierRelationId)
return true;
/* These are their indexes (see indexing.h) */
if (relationId == AuthIdRolnameIndexId ||
@@ -240,7 +242,9 @@ IsSharedRelation(Oid relationId)
relationId == SharedSecLabelObjectIndexId ||
relationId == TablespaceOidIndexId ||
relationId == TablespaceNameIndexId ||
- relationId == DbRoleSettingDatidRolidIndexId)
+ relationId == DbRoleSettingDatidRolidIndexId ||
+ relationId == ReplicationLocalIdentIndex ||
+ relationId == ReplicationExternalIdentIndex)
return true;
/* These are their toast tables and toast indexes (see toasting.h) */
if (relationId == PgShdescriptionToastTable ||
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2800f73..9fd2908 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -766,6 +766,13 @@ CREATE VIEW pg_user_mappings AS
REVOKE ALL on pg_user_mapping FROM public;
+
+CREATE VIEW pg_replication_identifier_progress AS
+ SELECT *
+ FROM pg_get_replication_identifier_progress();
+
+REVOKE ALL ON pg_replication_identifier_progress FROM public;
+
--
-- We have a few function definitions in here, too.
-- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index 310a45c..95bcffb 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
-OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o snapbuild.o
+OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o replication_identifier.o \
+ snapbuild.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eb7293f..5003e59 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -40,6 +40,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "storage/standby.h"
@@ -422,6 +423,15 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
}
}
+static inline bool
+FilterByOrigin(LogicalDecodingContext *ctx, RepNodeId origin_id)
+{
+ if (ctx->callbacks.filter_by_origin_cb == NULL)
+ return false;
+
+ return filter_by_origin_cb_wrapper(ctx, origin_id);
+}
+
/*
* Consolidated commit record handling between the different form of commit
* records.
@@ -430,8 +440,17 @@ static void
DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid)
{
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ XLogRecPtr commit_time = InvalidXLogRecPtr;
+ XLogRecPtr origin_id = InvalidRepNodeId;
int i;
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
/*
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
@@ -452,12 +471,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* the reorderbuffer to forget the content of the (sub-)transactions
* if not.
*
- * There basically two reasons we might not be interested in this
+ * There can be several reasons we might not be interested in this
* transaction:
* 1) We might not be interested in decoding transactions up to this
* LSN. This can happen because we previously decoded it and now just
* are restarting or if we haven't assembled a consistent snapshot yet.
* 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
*
* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
@@ -472,7 +492,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* ---
*/
if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
- (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database))
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
{
for (i = 0; i < parsed->nsubxacts; i++)
{
@@ -492,7 +513,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
/* replay actions of all transaction + subtransactions in order */
ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- parsed->xact_time);
+ commit_time, origin_id, origin_lsn);
}
/*
@@ -537,8 +558,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -579,8 +605,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_UPDATE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -628,8 +659,13 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_DELETE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
@@ -673,6 +709,10 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (rnode.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
tupledata = XLogRecGetBlockData(r, 0, &tuplelen);
data = tupledata;
@@ -685,6 +725,8 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
+
memcpy(&change->data.tp.relnode, &rnode, sizeof(RelFileNode));
/*
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 30baa45..fedd6f1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -39,6 +39,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "storage/proc.h"
@@ -46,6 +47,10 @@
#include "utils/memutils.h"
+RepNodeId replication_origin_id = InvalidRepNodeId; /* assumed identity */
+XLogRecPtr replication_origin_lsn;
+TimestampTz replication_origin_timestamp;
+
/* data for errcontext callback */
typedef struct LogicalErrorCallbackState
{
@@ -715,6 +720,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+bool
+filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepNodeId origin_id)
+{
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "shutdown";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_by_origin_cb(ctx, origin_id);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
/*
* Set the required catalog xmin horizon for historic snapshots in the current
* replication slot.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 20bb3b7..ae5d3af 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1255,7 +1255,8 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
void
ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time)
+ TimestampTz commit_time,
+ RepNodeId origin_id, XLogRecPtr origin_lsn)
{
ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
@@ -1273,6 +1274,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
/* serialize the last bunch of changes if we need start earlier anyway */
if (txn->nentries_mem != txn->nentries)
diff --git a/src/backend/replication/logical/replication_identifier.c b/src/backend/replication/logical/replication_identifier.c
new file mode 100644
index 0000000..d9adab8
--- /dev/null
+++ b/src/backend/replication/logical/replication_identifier.c
@@ -0,0 +1,1300 @@
+/*-------------------------------------------------------------------------
+ *
+ * replication_identifier.c
+ * Logical Replication Node Identifier and replication progress persistency
+ * support.
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/logical/replication_identifier.c
+ *
+ * NOTES
+ *
+ * This file provides the following:
+ * * Interface functions for naming nodes in a replication setup
+ * * A facility to efficiently store and persist replication progress in a
+ * efficient and durable manner.
+ *
+ * Replication identifiers consist out of a descriptive, user defined,
+ * external name and a short, thus space efficient, internal 2 byte one. This
+ * split exists because replication identifiers have to be stored in WAL and
+ * shared memory and long descriptors would be inefficient. For now only use
+ * 2 bytes for the internal id of a replication identifier as it seems
+ * unlikely that there soon will be more than 65k nodes in one replication
+ * setup; and using only two bytes allow us to be more space efficient.
+ *
+ * Replication progress is tracked in a shared memory table
+ * (ReplicationStates) that's dumped to disk every checkpoint. Entries
+ * ('slots') in this table are identified by the internal id. That's the case
+ * because it allows to increase replication progress during crash
+ * recovery. To allow doing so we store the original LSN (from the originating
+ * system) of a transaction in the commit record. That allows to recover the
+ * precise replayed state after crash recovery; without requiring synchronous
+ * commits. Allowing logical replication to use asynchronous commit is
+ * generally good for performance, but especially important as it allows a
+ * single threaded replay process to keep up with a source that has multiple
+ * backends generating changes concurrently. For efficiency and simplicity
+ * reasons a backend can setup a replication identifier as its origin (a
+ * "cached replication identifier") that's from then on the source of changes
+ * produced by the backend, until reset again.
+ *
+ * This infrastructure is intended to be used in cooperation with logical
+ * decoding. When replaying from a remote system the configured origin is
+ * provided to output plugins, allowing filtering and such.
+ *
+ *
+ * There are several levels of locking at work:
+ *
+ * * To create and drop replication identifiers a exclusive lock on
+ * pg_replication_slot is required for the duration. That allows us to
+ * safely and conflict free assign new identifiers using a dirty snapshot.
+ *
+ * * When creating a in-memory replication progress slot the
+ * ReplicationIdentifier LWLock has to be held exclusively; when iterating
+ * over the replication progress a shared lock has to be held, the same when
+ * advancing the replication progress of a individual backend that has not
+ * setup as the backend's cached replication identifier.
+ *
+ * * When manipulating or looking at the remote_lsn and local_lsn fields of a
+ * replication progress slot that slot's spinlock has to be held. That's
+ * primarily because we do not assume 8 byte writes (the LSN) is atomic on
+ * all our platforms, but it also simplifies memory ordering concerns
+ * between the remote and local lsn.
+ *
+ * ---------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <unistd.h>
+#include <sys/stat.h>
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+
+#include "catalog/indexing.h"
+
+#include "nodes/execnodes.h"
+
+#include "replication/replication_identifier.h"
+#include "replication/logical.h"
+
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/copydir.h"
+#include "storage/spin.h"
+
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/pg_lsn.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+
+/*
+ * Replay progress of a single remote node.
+ */
+typedef struct ReplicationState
+{
+ /*
+ * Local identifier for the remote node.
+ */
+ RepNodeId local_identifier;
+
+ /*
+ * Location of the latest commit from the remote side.
+ */
+ XLogRecPtr remote_lsn;
+
+ /*
+ * Remember the local lsn of the commit record so we can XLogFlush() to it
+ * during a checkpoint so we know the commit record actually is safe on
+ * disk.
+ */
+ XLogRecPtr local_lsn;
+
+ /*
+ * Slot is setup in backend?
+ */
+ pid_t acquired_by;
+
+ /*
+ * Spinlock protecting remote_lsn and local_lsn.
+ */
+ slock_t mutex;
+} ReplicationState;
+
+/*
+ * On disk version of ReplicationState.
+ */
+typedef struct ReplicationStateOnDisk
+{
+ RepNodeId local_identifier;
+ XLogRecPtr remote_lsn;
+} ReplicationStateOnDisk;
+
+
+/*
+ * Base address into a shared memory array of replication states of size
+ * max_replication_slots.
+ *
+ * XXX: Should we use a separate variable to size this rather than
+ * max_replication_slots?
+ */
+static ReplicationState *ReplicationStates;
+
+/*
+ * Backend-local, cached element from ReplicationStates for use in a backend
+ * replaying remote commits, so we don't have to search ReplicationStates for
+ * the backends current RepNodeId.
+ */
+static ReplicationState *cached_replication_state = NULL;
+
+/* Magic for on disk files. */
+#define REPLICATION_STATE_MAGIC ((uint32)0x1257DADE)
+
+/* XXX: move to c.h? */
+#ifndef UINT16_MAX
+#define UINT16_MAX (0xFFFFU)
+#endif
+
+static void
+CheckReplicationIdentifierPrerequisites(bool check_slots)
+{
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("only superusers can query or manipulate replication identifiers")));
+
+ if (check_slots && max_replication_slots == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot query or manipulate replication identifiers when max_replication_slots = 0")));
+
+}
+
+
+/* ---------------------------------------------------------------------------
+ * Functions for working with replication identifiers themselves.
+ * ---------------------------------------------------------------------------
+ */
+
+/*
+ * Check for a persistent repication identifier identified by the replication
+ * identifier's external name..
+ *
+ * Returns InvalidOid if the node isn't known yet.
+ */
+RepNodeId
+GetReplicationIdentifier(char *riname, bool missing_ok)
+{
+ Form_pg_replication_identifier ident;
+ Oid riident = InvalidOid;
+ HeapTuple tuple;
+ Datum riname_d;
+
+ riname_d = CStringGetTextDatum(riname);
+
+ tuple = SearchSysCache1(REPLIDREMOTE, riname_d);
+ if (HeapTupleIsValid(tuple))
+ {
+ ident = (Form_pg_replication_identifier) GETSTRUCT(tuple);
+ riident = ident->riident;
+ ReleaseSysCache(tuple);
+ }
+ else if (!missing_ok)
+ elog(ERROR, "cache lookup failed for replication identifier named %s",
+ riname);
+
+ return riident;
+}
+
+/*
+ * Create a persistent replication identifier.
+ *
+ * Needs to be called in a transaction.
+ */
+RepNodeId
+CreateReplicationIdentifier(char *riname)
+{
+ Oid riident;
+ HeapTuple tuple = NULL;
+ Relation rel;
+ Datum riname_d;
+ SnapshotData SnapshotDirty;
+ SysScanDesc scan;
+ ScanKeyData key;
+
+ riname_d = CStringGetTextDatum(riname);
+
+ Assert(IsTransactionState());
+
+ /*
+ * We need the numeric replication identifiers to be 16bit wide, so we
+ * cannot rely on the normal oid allocation. So we simply scan
+ * pg_replication_identifier for the first unused id. That's not
+ * particularly efficient, but this should be an fairly infrequent
+ * operation - we can easily spend a bit more code on this when it turns
+ * out it needs to be faster.
+ *
+ * We handle concurrency by taking an exclusive lock (allowing reads!)
+ * over the table for the duration of the search. Because we use a "dirty
+ * snapshot" we can read rows that other in-progress sessions have
+ * written, even though they would be invisible with normal snapshots. Due
+ * to the exclusive lock there's no danger that new rows can appear while
+ * we're checking.
+ */
+ InitDirtySnapshot(SnapshotDirty);
+
+ rel = heap_open(ReplicationIdentifierRelationId, ExclusiveLock);
+
+ for (riident = InvalidOid + 1; riident < UINT16_MAX; riident++)
+ {
+ bool nulls[Natts_pg_replication_identifier];
+ Datum values[Natts_pg_replication_identifier];
+ bool collides;
+ CHECK_FOR_INTERRUPTS();
+
+ ScanKeyInit(&key,
+ Anum_pg_replication_riident,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(riident));
+
+ scan = systable_beginscan(rel, ReplicationLocalIdentIndex,
+ true /* indexOK */,
+ &SnapshotDirty,
+ 1, &key);
+
+ collides = HeapTupleIsValid(systable_getnext(scan));
+
+ systable_endscan(scan);
+
+ if (!collides)
+ {
+ /*
+ * Ok, found an unused riident, insert the new row and do a CCI,
+ * so our callers can look it up if they want to.
+ */
+ memset(&nulls, 0, sizeof(nulls));
+
+ values[Anum_pg_replication_riident -1] = ObjectIdGetDatum(riident);
+ values[Anum_pg_replication_riname - 1] = riname_d;
+
+ tuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
+ simple_heap_insert(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple);
+ CommandCounterIncrement();
+ break;
+ }
+ }
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+
+ if (tuple == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("no free replication id could be found")));
+
+ heap_freetuple(tuple);
+ return riident;
+}
+
+
+/*
+ * Create a persistent replication identifier.
+ *
+ * Needs to be called in a transaction.
+ */
+void
+DropReplicationIdentifier(RepNodeId riident)
+{
+ HeapTuple tuple = NULL;
+ Relation rel;
+ int i;
+
+ Assert(IsTransactionState());
+
+ rel = heap_open(ReplicationIdentifierRelationId, ExclusiveLock);
+
+ /* cleanup the slot state info */
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state = &ReplicationStates[i];
+
+ /* found our slot */
+ if (state->local_identifier == riident)
+ {
+ if (state->acquired_by != 0)
+ {
+ elog(ERROR, "cannot drop slot that is setup in backend %d",
+ state->acquired_by);
+ }
+ /* reset entry */
+ state->local_identifier = InvalidRepNodeId;
+ state->remote_lsn = InvalidXLogRecPtr;
+ state->local_lsn = InvalidXLogRecPtr;
+ break;
+ }
+ }
+ LWLockRelease(ReplicationIdentifierLock);
+
+ tuple = SearchSysCache1(REPLIDIDENT, ObjectIdGetDatum(riident));
+ simple_heap_delete(rel, &tuple->t_self);
+ ReleaseSysCache(tuple);
+
+ CommandCounterIncrement();
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+}
+
+
+/*
+ * Lookup pg_replication_identifier via riident and return the external name.
+ *
+ * The external name is palloc'd in the calling context.
+ *
+ * Returns true if the identifier is known, false otherwise.
+ */
+bool
+GetReplicationInfoByIdentifier(RepNodeId riident, bool missing_ok, char **riname)
+{
+ HeapTuple tuple;
+ Form_pg_replication_identifier ric;
+
+ Assert(OidIsValid((Oid) riident));
+ Assert(riident != InvalidRepNodeId);
+ Assert(riident != DoNotReplicateRepNodeId);
+
+ tuple = SearchSysCache1(REPLIDIDENT,
+ ObjectIdGetDatum((Oid) riident));
+
+ if (HeapTupleIsValid(tuple))
+ {
+ ric = (Form_pg_replication_identifier) GETSTRUCT(tuple);
+ *riname = text_to_cstring(&ric->riname);
+ ReleaseSysCache(tuple);
+
+ return true;
+ }
+ else
+ {
+ *riname = NULL;
+
+ if (!missing_ok)
+ elog(ERROR, "cache lookup failed for replication identifier id: %u",
+ riident);
+
+ return false;
+ }
+}
+
+
+/* ---------------------------------------------------------------------------
+ * Functions for handling replication progress.
+ * ---------------------------------------------------------------------------
+ */
+
+Size
+ReplicationIdentifierShmemSize(void)
+{
+ Size size = 0;
+
+ /*
+ * FIXME: max_replication_slots is the wrong thing to use here, here we keep
+ * the replay state of *remote* transactions.
+ */
+ if (max_replication_slots == 0)
+ return size;
+
+ size = add_size(size,
+ mul_size(max_replication_slots, sizeof(ReplicationState)));
+ return size;
+}
+
+void
+ReplicationIdentifierShmemInit(void)
+{
+ bool found;
+
+ if (max_replication_slots == 0)
+ return;
+
+ ReplicationStates = (ReplicationState *)
+ ShmemInitStruct("ReplicationIdentifierState",
+ ReplicationIdentifierShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ int i;
+
+ MemSet(ReplicationStates, 0, ReplicationIdentifierShmemSize());
+
+ for (i = 0; i < max_replication_slots; i++)
+ SpinLockInit(&ReplicationStates[i].mutex);
+ }
+}
+
+/* ---------------------------------------------------------------------------
+ * Perform a checkpoint of replication identifier's progress with respect to
+ * the replayed remote_lsn. Make sure that all transactions we refer to in the
+ * checkpoint (local_lsn) are actually on-disk. This might not yet be the case
+ * if the transactions were originally committed asynchronously.
+ *
+ * We store checkpoints in the following format:
+ * +-------+------------------------+------------------+-----+--------+
+ * | MAGIC | ReplicationStateOnDisk | struct Replic... | ... | CRC32C | EOF
+ * +-------+------------------------+------------------+-----+--------+
+ *
+ * So its just the magic, followed by the statically sized
+ * ReplicationStateOnDisk structs. Note that the maximum number of
+ * ReplicationStates is determined by max_replication_slots.
+ * ---------------------------------------------------------------------------
+ */
+void
+CheckPointReplicationIdentifier(void)
+{
+ const char *tmppath = "pg_logical/replident_checkpoint.tmp";
+ const char *path = "pg_logical/replident_checkpoint";
+ int tmpfd;
+ int i;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ pg_crc32 crc;
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ /* make sure no old temp file is remaining */
+ if (unlink(tmppath) < 0 && errno != ENOENT)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ path)));
+
+ /*
+ * no other backend can perform this at the same time, we're protected by
+ * CheckpointLock.
+ */
+ tmpfd = OpenTransientFile((char *) tmppath,
+ O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
+ S_IRUSR | S_IWUSR);
+ if (tmpfd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m",
+ tmppath)));
+
+ /* write magic */
+ if ((write(tmpfd, &magic, sizeof(magic))) != sizeof(magic))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ /* prevent concurrent creations/drops */
+ LWLockAcquire(ReplicationIdentifierLock, LW_SHARED);
+
+ /* write actual data */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationStateOnDisk disk_state;
+ ReplicationState *curstate = &ReplicationStates[i];
+ XLogRecPtr local_lsn;
+
+ if (curstate->local_identifier == InvalidRepNodeId)
+ continue;
+
+ disk_state.local_identifier = curstate->local_identifier;
+
+ SpinLockAcquire(&curstate->mutex);
+ disk_state.remote_lsn = curstate->remote_lsn;
+ local_lsn = curstate->local_lsn;
+ SpinLockRelease(&curstate->mutex);
+
+ /* make sure we only write out a commit that's persistent */
+ XLogFlush(local_lsn);
+
+ if ((write(tmpfd, &disk_state, sizeof(disk_state))) !=
+ sizeof(disk_state))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+ }
+
+ LWLockRelease(ReplicationIdentifierLock);
+
+ /* write out the CRC */
+ FIN_CRC32C(crc);
+ if ((write(tmpfd, &crc, sizeof(crc))) != sizeof(crc))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+
+ /* fsync the temporary file */
+ if (pg_fsync(tmpfd) != 0)
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync file \"%s\": %m",
+ tmppath)));
+ }
+
+ CloseTransientFile(tmpfd);
+
+ /* rename to permanent file, fsync file and directory */
+ if (rename(tmppath, path) != 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ tmppath, path)));
+ }
+
+ fsync_fname((char *) path, false);
+ fsync_fname("pg_logical", true);
+}
+
+/*
+ * Recover replication replay status from checkpoint data saved earlier by
+ * CheckPointReplicationIdentifier.
+ *
+ * This only needs to be called at startup and *not* during every checkpoint
+ * read during recovery (e.g. in HS or PITR from a base backup) afterwards. All
+ * state thereafter can be recovered by looking at commit records.
+ */
+void
+StartupReplicationIdentifier(void)
+{
+ const char *path = "pg_logical/replident_checkpoint";
+ int fd;
+ int readBytes;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ int last_state = 0;
+ pg_crc32 file_crc;
+ pg_crc32 crc;
+
+ /* don't want to overwrite already existing state */
+#ifdef USE_ASSERT_CHECKING
+ static bool already_started = false;
+ Assert(!already_started);
+ already_started = true;
+#endif
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ elog(LOG, "starting up replication identifiers");
+
+ fd = OpenTransientFile((char *) path, O_RDONLY | PG_BINARY, 0);
+
+ /*
+ * might have had max_replication_slots == 0 last run, or we just brought up a
+ * standby.
+ */
+ if (fd < 0 && errno == ENOENT)
+ return;
+ else if (fd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+
+ /* verify magic, thats written even if nothing was active */
+ readBytes = read(fd, &magic, sizeof(magic));
+ if (readBytes != sizeof(magic))
+ ereport(PANIC,
+ (errmsg("could not read file \"%s\": %m",
+ path)));
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ if (magic != REPLICATION_STATE_MAGIC)
+ ereport(PANIC,
+ (errmsg("replication checkpoint has wrong magic %u instead of %u",
+ magic, REPLICATION_STATE_MAGIC)));
+
+ /* we can skip locking here, no other access is possible */
+
+ /* recover individual states, until there are no more to be found */
+ while (true)
+ {
+ ReplicationStateOnDisk disk_state;
+
+ readBytes = read(fd, &disk_state, sizeof(disk_state));
+
+ /* no further data */
+ if (readBytes == sizeof(crc))
+ {
+ /* not pretty, but simple ... */
+ file_crc = *(pg_crc32*) &disk_state;
+ break;
+ }
+
+ if (readBytes < 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ path)));
+ }
+
+ if (readBytes != sizeof(disk_state))
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": read %d of %zu",
+ path, readBytes, sizeof(disk_state))));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+
+ if (last_state == max_replication_slots)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state could be found, increase max_replication_slots")));
+
+ /* copy data to shared memory */
+ ReplicationStates[last_state].local_identifier = disk_state.local_identifier;
+ ReplicationStates[last_state].remote_lsn = disk_state.remote_lsn;
+ last_state++;
+
+ elog(LOG, "recovered replication state of node %u to %X/%X",
+ disk_state.local_identifier,
+ (uint32)(disk_state.remote_lsn >> 32),
+ (uint32)disk_state.remote_lsn);
+ }
+
+ /* now check checksum */
+ FIN_CRC32C(crc);
+ if (file_crc != crc)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("replication_slot_checkpoint has wrong checksum %u, expected %u",
+ crc, file_crc)));
+
+ CloseTransientFile(fd);
+}
+
+/*
+ * Tell the replication identifier machinery that a commit from 'node' that
+ * originated at the LSN remote_commit on the remote node was replayed
+ * successfully and that we don't need to do so again. In combination with
+ * setting up replication_origin_lsn and replication_origin_id that ensures we
+ * won't loose knowledge about that after a crash if the the transaction had a
+ * persistent effect (think of asynchronous commits).
+ *
+ * local_commit needs to be a local LSN of the commit so that we can make sure
+ * uppon a checkpoint that enough WAL has been persisted to disk.
+ *
+ * Needs to be called with a RowExclusiveLock on pg_replication_identifier,
+ * unless running in recovery.
+ */
+void
+AdvanceReplicationIdentifier(RepNodeId node,
+ XLogRecPtr remote_commit,
+ XLogRecPtr local_commit)
+{
+ int i;
+ int free_slot = -1;
+ ReplicationState *replication_state = NULL;
+
+ Assert(node != InvalidRepNodeId);
+
+ /* we don't track DoNotReplicateRepNodeId */
+ if (node == DoNotReplicateRepNodeId)
+ return;
+
+ /*
+ * XXX: should we restore into a hashtable and dump into shmem only after
+ * recovery finished?
+ */
+
+ /* Lock exclusively, as we may have to create a new table entry. */
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ /*
+ * Search for either an existing slot for that identifier or a free one we
+ * can use.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &ReplicationStates[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->local_identifier == InvalidRepNodeId &&
+ free_slot == -1)
+ {
+ free_slot = i;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->local_identifier != node)
+ continue;
+
+ if (curstate->acquired_by != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication identiefer %d is already active for pid %d",
+ curstate->local_identifier, curstate->acquired_by)));
+ }
+
+ /* ok, found slot */
+ replication_state = curstate;
+ break;
+ }
+
+ if (replication_state == NULL && free_slot == -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state slot could be found for replication identifier %u",
+ node),
+ errhint("Increase max_replication_slots and try again.")));
+ else if (replication_state == NULL)
+ {
+ /* initialize new slot */
+ replication_state = &ReplicationStates[free_slot];
+ Assert(replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(replication_state->local_lsn == InvalidXLogRecPtr);
+ replication_state->local_identifier = node;
+ }
+
+ Assert(replication_state->local_identifier != InvalidRepNodeId);
+
+ /*
+ * Due to - harmless - race conditions during a checkpoint we could see
+ * values here that are older than the ones we already have in
+ * memory. Don't overwrite those.
+ */
+ SpinLockAcquire(&replication_state->mutex);
+ if (replication_state->remote_lsn < remote_commit)
+ replication_state->remote_lsn = remote_commit;
+ if (replication_state->local_lsn < local_commit)
+ replication_state->local_lsn = local_commit;
+ SpinLockRelease(&replication_state->mutex);
+
+ /*
+ * Release *after* changing the LSNs, slot isn't acquired and thus could
+ * otherwise be dropped anytime.
+ */
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+
+XLogRecPtr
+ReplicationIdentifierProgress(RepNodeId node, bool flush)
+{
+ int i;
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ XLogRecPtr remote_lsn = InvalidXLogRecPtr;
+
+ /* prevent slots from being concurrently dropped */
+ LWLockAcquire(ReplicationIdentifierLock, LW_SHARED);
+
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state;
+
+ state = &ReplicationStates[i];
+
+ if (state->local_identifier == node)
+ {
+ SpinLockAcquire(&state->mutex);
+ remote_lsn = state->remote_lsn;
+ local_lsn = state->local_lsn;
+ SpinLockRelease(&state->mutex);
+ break;
+ }
+ }
+
+ LWLockRelease(ReplicationIdentifierLock);
+
+ if (flush && local_lsn != InvalidXLogRecPtr)
+ XLogFlush(local_lsn);
+
+ return remote_lsn;
+}
+
+/*
+ * Tear down a (possibly) cached replication identifier during process exit.
+ */
+static void
+ReplicationIdentifierExitCleanup(int code, Datum arg)
+{
+
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ if (cached_replication_state != NULL &&
+ cached_replication_state->acquired_by == MyProcPid)
+ {
+ cached_replication_state->acquired_by = 0;
+ cached_replication_state = NULL;
+ }
+
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+/*
+ * Setup a replication identifier in the shared memory struct if it doesn't
+ * already exists and cache access to the specific ReplicationSlot so the
+ * array doesn't have to be searched when calling
+ * AdvanceCachedReplicationIdentifier().
+ *
+ * Obviously only one such cached identifier can exist per process and the
+ * current cached value can only be set again after the previous value is torn
+ * down with TeardownCachedReplicationIdentifier().
+ */
+void
+SetupCachedReplicationIdentifier(RepNodeId node)
+{
+ static bool registered_cleanup;
+ int i;
+ int free_slot = -1;
+
+ if (!registered_cleanup)
+ {
+ on_shmem_exit(ReplicationIdentifierExitCleanup, 0);
+ registered_cleanup = true;
+ }
+
+ Assert(max_replication_slots > 0);
+
+ if (cached_replication_state != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot setup replication origin when one is already setup")));
+
+ /* Lock exclusively, as we may have to create a new table entry. */
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ /*
+ * Search for either an existing slot for that identifier or a free one we
+ * can use.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &ReplicationStates[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->local_identifier == InvalidRepNodeId &&
+ free_slot == -1)
+ {
+ free_slot = i;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->local_identifier != node)
+ continue;
+
+ else if (curstate->acquired_by != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication identiefer %d is already active for pid %d",
+ curstate->local_identifier, curstate->acquired_by)));
+ }
+
+ /* ok, found slot */
+ cached_replication_state = curstate;
+ }
+
+
+ if (cached_replication_state == NULL && free_slot == -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state slot could be found for replication identifier %u",
+ node),
+ errhint("Increase max_replication_slots and try again.")));
+ else if (cached_replication_state == NULL)
+ {
+ /* initialize new slot */
+ cached_replication_state = &ReplicationStates[free_slot];
+ Assert(cached_replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(cached_replication_state->local_lsn == InvalidXLogRecPtr);
+ cached_replication_state->local_identifier = node;
+ }
+
+
+ Assert(cached_replication_state->local_identifier != InvalidRepNodeId);
+
+ cached_replication_state->acquired_by = MyProcPid;
+
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+/*
+ * Make currently cached replication identifier unavailable so a new one can
+ * be setup with SetupCachedReplicationIdentifier().
+ *
+ * This function may only be called if a previous identifier was setup with
+ * SetupCachedReplicationIdentifier().
+ */
+void
+TeardownCachedReplicationIdentifier(void)
+{
+ Assert(max_replication_slots != 0);
+
+ if (cached_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("no replication identifier is set up")));
+
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ cached_replication_state->acquired_by = 0;
+ cached_replication_state = NULL;
+
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+/*
+ * Do the same work AdvanceReplicationIdentifier() does, just on a pre-cached
+ * identifier. This is noticeably cheaper if you only ever work on a single
+ * replication identifier.
+ */
+void
+AdvanceCachedReplicationIdentifier(XLogRecPtr remote_commit,
+ XLogRecPtr local_commit)
+{
+ Assert(cached_replication_state != NULL);
+ Assert(cached_replication_state->local_identifier != InvalidRepNodeId);
+
+ SpinLockAcquire(&cached_replication_state->mutex);
+ if (cached_replication_state->local_lsn < local_commit)
+ cached_replication_state->local_lsn = local_commit;
+ if (cached_replication_state->remote_lsn < remote_commit)
+ cached_replication_state->remote_lsn = remote_commit;
+ SpinLockRelease(&cached_replication_state->mutex);
+}
+
+/*
+ * Ask the machinery about the point up to which we successfully replayed
+ * changes from a already setup & cached replication identifier.
+ */
+XLogRecPtr
+CachedReplicationIdentifierProgress(void)
+{
+ XLogRecPtr remote_lsn;
+
+ Assert(cached_replication_state != NULL);
+
+ SpinLockAcquire(&cached_replication_state->mutex);
+ remote_lsn = cached_replication_state->remote_lsn;
+ SpinLockRelease(&cached_replication_state->mutex);
+
+ return remote_lsn;
+}
+
+
+
+/* ---------------------------------------------------------------------------
+ * SQL functions for working with replication identifiers.
+ *
+ * These mostly should be fairly short wrappers around more generic functions.
+ * ---------------------------------------------------------------------------
+ */
+
+/*
+ * Return the internal replication identifier for the passed in external one.
+ */
+Datum
+pg_replication_identifier_get(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ riident = GetReplicationIdentifier(name, true);
+
+ pfree(name);
+
+ if (OidIsValid(riident))
+ PG_RETURN_OID(riident);
+ PG_RETURN_NULL();
+}
+
+/*
+ * Create a replication identifier with the passed in name, and return the
+ * assigned internal identifier.
+ */
+Datum
+pg_replication_identifier_create(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ riident = CreateReplicationIdentifier(name);
+
+ pfree(name);
+
+ PG_RETURN_OID(riident);
+}
+
+/*
+ * Setup a cached replication identifier in the current session.
+ */
+Datum
+pg_replication_identifier_setup_replaying_from(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId origin;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ origin = GetReplicationIdentifier(name, false);
+ SetupCachedReplicationIdentifier(origin);
+
+ replication_origin_id = origin;
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_is_replaying(PG_FUNCTION_ARGS)
+{
+ CheckReplicationIdentifierPrerequisites(false);
+
+ PG_RETURN_BOOL(replication_origin_id != InvalidRepNodeId);
+}
+
+Datum
+pg_replication_identifier_reset_replaying_from(PG_FUNCTION_ARGS)
+{
+ CheckReplicationIdentifierPrerequisites(true);
+
+ TeardownCachedReplicationIdentifier();
+
+ replication_origin_id = InvalidRepNodeId;
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_setup_tx_origin(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr location = PG_GETARG_LSN(0);
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ if (cached_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("need to setup the origin id first")));
+
+ replication_origin_lsn = location;
+ replication_origin_timestamp = PG_GETARG_TIMESTAMPTZ(1);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_advance(PG_FUNCTION_ARGS)
+{
+ text *name = PG_GETARG_TEXT_P(0);
+ XLogRecPtr remote_commit = PG_GETARG_LSN(1);
+ RepNodeId node;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ /* lock to prevent the replication identifier from vanishing */
+ LockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ node = GetReplicationIdentifier(text_to_cstring(name), false);
+
+ /*
+ * Can't sensibly pass a local commit to be flushed at checkpoint - this
+ * xact hasn't committed yet. This is why this function should be used to
+ * set up the intial replication state, but not for replay.
+ */
+ AdvanceReplicationIdentifier(node, remote_commit, InvalidXLogRecPtr);
+
+ UnlockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_drop(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+
+ riident = GetReplicationIdentifier(name, false);
+ Assert(OidIsValid(riident));
+
+ DropReplicationIdentifier(riident);
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Return the replication progress for an individual replication identifier.
+ *
+ * If 'flush' is set to true it is ensured that the returned value corresponds
+ * to a local transaction that has been flushed. this is useful if asychronous
+ * commits are used when replaying replicated transactions.
+ */
+Datum
+pg_replication_identifier_progress(PG_FUNCTION_ARGS)
+{
+ char *name;
+ bool flush;
+ RepNodeId riident;
+ XLogRecPtr remote_lsn = InvalidXLogRecPtr;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ flush = PG_GETARG_BOOL(1);
+
+ riident = GetReplicationIdentifier(name, false);
+ Assert(OidIsValid(riident));
+
+ remote_lsn = ReplicationIdentifierProgress(riident, flush);
+
+ if (remote_lsn == InvalidXLogRecPtr)
+ PG_RETURN_NULL();
+
+ PG_RETURN_LSN(remote_lsn);
+}
+
+
+Datum
+pg_get_replication_identifier_progress(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ int i;
+#define REPLICATION_IDENTIFIER_PROGRESS_COLS 4
+
+ /* we we want to return 0 rows if slot is set to zero */
+ CheckReplicationIdentifierPrerequisites(false);
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not allowed in this context")));
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (tupdesc->natts != REPLICATION_IDENTIFIER_PROGRESS_COLS)
+ elog(ERROR, "wrong function definition");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* prevent slots from being concurrently dropped */
+ LWLockAcquire(ReplicationIdentifierLock, LW_SHARED);
+
+ /*
+ * Iterate through all possible ReplicationStates, display if they are
+ * filled. Note that we do not take any locks, so slightly corrupted/out
+ * of date values are a possibility.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state;
+ Datum values[REPLICATION_IDENTIFIER_PROGRESS_COLS];
+ bool nulls[REPLICATION_IDENTIFIER_PROGRESS_COLS];
+ char *riname;
+
+ state = &ReplicationStates[i];
+
+ /* unused slot, nothing to display */
+ if (state->local_identifier == InvalidRepNodeId)
+ continue;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ values[ 0] = ObjectIdGetDatum(state->local_identifier);
+
+ /*
+ * We're not preventing the identifier to be dropped concurrently, so
+ * silently accept that it might be gone.
+ */
+ if (!GetReplicationInfoByIdentifier(state->local_identifier, true,
+ &riname))
+ continue;
+
+ values[ 1] = CStringGetTextDatum(riname);
+
+ SpinLockAcquire(&state->mutex);
+
+ values[ 2] = LSNGetDatum(state->remote_lsn);
+
+ values[ 3] = LSNGetDatum(state->local_lsn);
+
+ SpinLockRelease(&state->mutex);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ tuplestore_donestoring(tupstore);
+
+ LWLockRelease(ReplicationIdentifierLock);
+
+#undef REPLICATION_IDENTIFIER_PROGRESS_COLS
+
+ return (Datum) 0;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 16b9808..e927698 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "replication/slot.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "replication/replication_identifier.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/ipc.h"
@@ -132,6 +133,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
size = add_size(size, CheckpointerShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
+ size = add_size(size, ReplicationIdentifierShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -238,6 +240,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
CheckpointerShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
+ ReplicationIdentifierShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index bd27168..fdccb95 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -54,6 +54,7 @@
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
+#include "catalog/pg_replication_identifier.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_tablespace.h"
#include "catalog/pg_ts_config.h"
@@ -620,6 +621,28 @@ static const struct cachedesc cacheinfo[] = {
},
128
},
+ {ReplicationIdentifierRelationId, /* REPLIDIDENT */
+ ReplicationLocalIdentIndex,
+ 1,
+ {
+ Anum_pg_replication_riident,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
+ {ReplicationIdentifierRelationId, /* REPLIDREMOTE */
+ ReplicationExternalIdentIndex,
+ 1,
+ {
+ Anum_pg_replication_riname,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
{RewriteRelationId, /* RULERELNAME */
RewriteRelRulenameIndexId,
2,
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index a16089f..8ae47a9 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -55,6 +55,8 @@
#include "common/fe_memutils.h"
#include "storage/large_object.h"
#include "pg_getopt.h"
+#include "replication/logical.h"
+#include "replication/replication_identifier.h"
static ControlFileData ControlFile; /* pg_control values */
@@ -1088,6 +1090,9 @@ WriteEmptyXLOG(void)
record->xl_tot_len = SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort + sizeof(CheckPoint);
record->xl_info = XLOG_CHECKPOINT_SHUTDOWN;
record->xl_rmid = RM_XLOG_ID;
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+ record->xl_origin_id = InvalidRepNodeId;
+#endif
recptr += SizeOfXLogRecord;
*(recptr++) = XLR_BLOCK_ID_DATA_SHORT;
*(recptr++) = sizeof(CheckPoint);
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 93d1217..578513d 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -13,6 +13,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
+#include "replication/replication_identifier.h"
#include "utils/guc.h"
@@ -21,18 +22,13 @@ extern PGDLLIMPORT bool track_commit_timestamp;
extern bool check_track_commit_timestamp(bool *newval, void **extra,
GucSource source);
-typedef uint32 CommitTsNodeId;
-#define InvalidCommitTsNodeId 0
-
-extern void CommitTsSetDefaultNodeId(CommitTsNodeId nodeid);
-extern CommitTsNodeId CommitTsGetDefaultNodeId(void);
extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid, bool do_xlog);
+ RepNodeId nodeid, bool do_xlog);
extern bool TransactionIdGetCommitTsData(TransactionId xid,
- TimestampTz *ts, CommitTsNodeId *nodeid);
+ TimestampTz *ts, RepNodeId *nodeid);
extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
- CommitTsNodeId *nodeid);
+ RepNodeId *nodeid);
extern Size CommitTsShmemBuffers(void);
extern Size CommitTsShmemSize(void);
@@ -58,7 +54,7 @@ extern void AdvanceOldestCommitTs(TransactionId oldestXact);
typedef struct xl_commit_ts_set
{
TimestampTz timestamp;
- CommitTsNodeId nodeid;
+ RepNodeId nodeid;
TransactionId mainxid;
/* subxact Xids follow */
} xl_commit_ts_set;
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fdf3ea3..9e78403 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -131,6 +131,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_RELFILENODES (1U << 2)
#define XACT_XINFO_HAS_INVALS (1U << 3)
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
+#define XACT_XINFO_HAS_ORIGIN (1U << 5)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -217,6 +218,12 @@ typedef struct xl_xact_twophase
} xl_xact_twophase;
#define MinSizeOfXactInvals offsetof(xl_xact_invals, msgs)
+typedef struct xl_xact_origin
+{
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_origin;
+
typedef struct xl_xact_commit
{
TimestampTz xact_time; /* time of commit */
@@ -227,6 +234,7 @@ typedef struct xl_xact_commit
/* xl_xact_relfilenodes follows if XINFO_HAS_RELFILENODES */
/* xl_xact_invals follows if XINFO_HAS_INVALS */
/* xl_xact_twophase follows if XINFO_HAS_TWOPHASE */
+ /* xl_xact_origin follows if XINFO_HAS_ORIGIN */
} xl_xact_commit;
#define MinSizeOfXactCommit (offsetof(xl_xact_commit, xact_time) + sizeof(TimestampTz))
@@ -267,6 +275,9 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
typedef struct xl_xact_parsed_abort
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2b1f423..f08b676 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -85,6 +85,7 @@ typedef enum
} RecoveryTargetType;
extern XLogRecPtr XactLastRecEnd;
+extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd;
extern bool reachedConsistency;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 12a1b61..95b00b6 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
/*
* Each page of XLOG file has a header like this:
*/
-#define XLOG_PAGE_MAGIC 0xD083 /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD085 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 6638c1d..bd8dd70 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -45,6 +45,12 @@ typedef uint64 XLogSegNo;
typedef uint32 TimeLineID;
/*
+ * Denotes the node on which the action causing a wal record to be logged
+ * originated on.
+ */
+typedef uint16 RepNodeId;
+
+/*
* Because O_DIRECT bypasses the kernel buffers, and because we never
* read those buffers except during crash recovery or if wal_level != minimal,
* it is a win to use it in all cases where we sync on each write(). We could
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 609bfe3..aa3e26e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -127,6 +127,10 @@ struct XLogReaderState
uint32 main_data_len; /* main data portion's length */
uint32 main_data_bufsz; /* allocated size of the buffer */
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ RepNodeId record_origin;
+#endif
+
/* information about blocks referenced by the record. */
DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
@@ -186,6 +190,11 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+#define XLogRecGetOrigin(decoder) ((decoder)->decoded_record->xl_origin_id)
+#else
+#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#endif
#define XLogRecGetData(decoder) ((decoder)->main_data)
#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 09bbcb1..507b90a 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -45,7 +45,11 @@ typedef struct XLogRecord
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+ RepNodeId xl_origin_id; /* what node did originally cause this record to be written */
+#else
/* 2 bytes of padding here, initialize to zero */
+#endif
pg_crc32 xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
@@ -212,5 +216,8 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+#define XLR_BLOCK_ID_ORIGIN 253
+#endif
#endif /* XLOGRECORD_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index da6035f..c5247a5 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201503191
+#define CATALOG_VERSION_NO 201503241
#endif
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index a680229..405528d 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -305,6 +305,12 @@ DECLARE_UNIQUE_INDEX(pg_policy_oid_index, 3257, on pg_policy using btree(oid oid
DECLARE_UNIQUE_INDEX(pg_policy_polrelid_polname_index, 3258, on pg_policy using btree(polrelid oid_ops, polname name_ops));
#define PolicyPolrelidPolnameIndexId 3258
+DECLARE_UNIQUE_INDEX(pg_replication_identifier_riiident_index, 6001, on pg_replication_identifier using btree(riident oid_ops));
+#define ReplicationLocalIdentIndex 6001
+
+DECLARE_UNIQUE_INDEX(pg_replication_identifier_riname_index, 6002, on pg_replication_identifier using btree(riname varchar_pattern_ops));
+#define ReplicationExternalIdentIndex 6002
+
/* last step of initialization script: build the indexes declared above */
BUILD_INDICES
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 3c218a3..4e6b789 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5170,6 +5170,36 @@ DESCR("rank of hypothetical row without gaps");
DATA(insert OID = 3993 ( dense_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
DESCR("aggregate final function");
+/* replication_identifier.h */
+DATA(insert OID = 6003 ( pg_replication_identifier_create PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_create _null_ _null_ _null_ ));
+DESCR("create local replication identifier for the passed external one");
+
+DATA(insert OID = 6004 ( pg_replication_identifier_drop PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_drop _null_ _null_ _null_ ));
+DESCR("drop existing replication identifier");
+
+DATA(insert OID = 6005 ( pg_replication_identifier_get PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_get _null_ _null_ _null_ ));
+DESCR("translate the external node identifier to a local one");
+
+DATA(insert OID = 6006 ( pg_replication_identifier_setup_replaying_from PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_setup_replaying_from _null_ _null_ _null_ ));
+DESCR("setup from which node we are replaying transactions from currently");
+
+DATA(insert OID = 6007 ( pg_replication_identifier_reset_replaying_from PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_replication_identifier_reset_replaying_from _null_ _null_ _null_ ));
+DESCR("teardown configured replication identity");
+
+DATA(insert OID = 6008 ( pg_replication_identifier_setup_tx_origin PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "3220 1184" _null_ _null_ _null_ _null_ pg_replication_identifier_setup_tx_origin _null_ _null_ _null_ ));
+DESCR("setup transaction timestamp and origin lsn");
+
+DATA(insert OID = 6009 ( pg_replication_identifier_is_replaying PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_replication_identifier_is_replaying _null_ _null_ _null_ ));
+DESCR("is a replication identifier setup");
+
+DATA(insert OID = 6010 ( pg_replication_identifier_advance PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "25 3220" _null_ _null_ _null_ _null_ pg_replication_identifier_advance _null_ _null_ _null_ ));
+DESCR("advance replication itentifier to specific location");
+
+DATA(insert OID = 6011 ( pg_replication_identifier_progress PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 3220 "25 16" _null_ _null_ _null_ _null_ pg_replication_identifier_progress _null_ _null_ _null_ ));
+DESCR("get an individualreplication identifier's replication progress");
+
+DATA(insert OID = 6012 ( pg_get_replication_identifier_progress PGNSP PGUID 12 1 100 0 0 f f f f f t v 0 0 2249 "" "{26,25,3220,3220}" "{o,o,o,o}" "{local_id, external_id, remote_lsn, local_lsn}" _null_ pg_get_replication_identifier_progress _null_ _null_ _null_ ));
+DESCR("get progress for all replication identifiers");
/*
* Symbolic values for provolatile column: these indicate whether the result
diff --git a/src/include/catalog/pg_replication_identifier.h b/src/include/catalog/pg_replication_identifier.h
new file mode 100644
index 0000000..d72c839
--- /dev/null
+++ b/src/include/catalog/pg_replication_identifier.h
@@ -0,0 +1,74 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_replication_identifier.h
+ * Persistent Replication Node Identifiers
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/pg_replication_identifier.h
+ *
+ * NOTES
+ * the genbki.pl script reads this file and generates .bki
+ * information from the DATA() statements.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_REPLICATION_IDENTIFIER_H
+#define PG_REPLICATION_IDENTIFIER_H
+
+#include "catalog/genbki.h"
+#include "access/xlogdefs.h"
+
+/* ----------------
+ * pg_replication_identifier. cpp turns this into
+ * typedef struct FormData_pg_replication_identifier
+ * ----------------
+ */
+#define ReplicationIdentifierRelationId 6000
+
+CATALOG(pg_replication_identifier,6000) BKI_SHARED_RELATION BKI_WITHOUT_OIDS
+{
+ /*
+ * Locally known identifier that get included into WAL.
+ *
+ * This should never leave the system.
+ *
+ * Needs to fit into a uint16, so we don't waste too much space in WAL
+ * records. For this reason we don't use a normal Oid column here, since
+ * we need to handle allocation of new values manually.
+ */
+ Oid riident;
+
+ /*
+ * Variable-length fields start here, but we allow direct access to
+ * riname.
+ */
+
+ /* external, free-format, identifier */
+ text riname BKI_FORCE_NOT_NULL;
+#ifdef CATALOG_VARLEN /* further variable-length fields */
+#endif
+} FormData_pg_replication_identifier;
+
+/* ----------------
+ * Form_pg_extension corresponds to a pointer to a tuple with
+ * the format of pg_extension relation.
+ * ----------------
+ */
+typedef FormData_pg_replication_identifier *Form_pg_replication_identifier;
+
+/* ----------------
+ * compiler constants for pg_replication_identifier
+ * ----------------
+ */
+#define Natts_pg_replication_identifier 2
+#define Anum_pg_replication_riident 1
+#define Anum_pg_replication_riname 2
+
+/* ----------------
+ * pg_replication_identifier has no initial contents
+ * ----------------
+ */
+
+#endif /* PG_REPLICTION_IDENTIFIER_H */
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 5cfc0ae..c787523 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -265,6 +265,12 @@
#endif
/*
+ * Temporary switch to change between using xlog padding or a separate block
+ * id in the record to record the xlog origin of a record.
+ */
+/* #define REPLICATION_IDENTIFIER_REUSE_PADDING */
+
+/*
* Define this to cause palloc()'d memory to be filled with random data, to
* facilitate catching code that depends on the contents of uninitialized
* memory. Caution: this is horrendously expensive.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index cce4394..f78fb8f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -97,4 +97,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepNodeId origin_id);
+
#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 0935c1b..26095b1 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -74,6 +74,13 @@ typedef void (*LogicalDecodeCommitCB) (
XLogRecPtr commit_lsn);
/*
+ * Filter changes by origin.
+ */
+typedef bool (*LogicalDecodeFilterByOriginCB) (
+ struct LogicalDecodingContext *,
+ RepNodeId origin_id);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (
@@ -89,6 +96,7 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f1e0f57..0c13fca 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -68,6 +68,8 @@ typedef struct ReorderBufferChange
/* The type of change. */
enum ReorderBufferChangeType action;
+ RepNodeId origin_id;
+
/*
* Context data for the change, which part of the union is valid depends
* on action/action_internal.
@@ -166,6 +168,10 @@ typedef struct ReorderBufferTXN
*/
XLogRecPtr restart_decoding_lsn;
+ /* origin of the change that caused this transaction */
+ RepNodeId origin_id;
+ XLogRecPtr origin_lsn;
+
/*
* Commit time, only known when we read the actual commit record.
*/
@@ -339,7 +345,7 @@ void ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
void ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time);
+ TimestampTz commit_time, RepNodeId origin_id, XLogRecPtr origin_lsn);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
diff --git a/src/include/replication/replication_identifier.h b/src/include/replication/replication_identifier.h
new file mode 100644
index 0000000..47cc032
--- /dev/null
+++ b/src/include/replication/replication_identifier.h
@@ -0,0 +1,62 @@
+/*-------------------------------------------------------------------------
+ * replication_identifier.h
+ * Exports from replication/logical/replication_identifier.c
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/replication_identifier.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef REPLICATION_IDENTIFIER_H
+#define REPLICATION_IDENTIFIER_H
+
+#include "access/xlogdefs.h"
+#include "catalog/pg_replication_identifier.h"
+#include "replication/logical.h"
+
+#define InvalidRepNodeId 0
+#define DoNotReplicateRepNodeId UINT16_MAX
+
+extern PGDLLIMPORT RepNodeId replication_origin_id;
+extern PGDLLIMPORT XLogRecPtr replication_origin_lsn;
+extern PGDLLIMPORT TimestampTz replication_origin_timestamp;
+
+/* API for querying & manipulating replication identifiers */
+extern RepNodeId GetReplicationIdentifier(char *name, bool missing_ok);
+extern RepNodeId CreateReplicationIdentifier(char *name);
+extern bool GetReplicationInfoByIdentifier(RepNodeId riident, bool missing_ok,
+ char **riname);
+extern void DropReplicationIdentifier(RepNodeId riident);
+
+/* API for querying & manipulating replication progress */
+extern void AdvanceReplicationIdentifier(RepNodeId node,
+ XLogRecPtr remote_commit,
+ XLogRecPtr local_commit);
+extern XLogRecPtr ReplicationIdentifierProgress(RepNodeId node, bool flush);
+extern void AdvanceCachedReplicationIdentifier(XLogRecPtr remote_commit,
+ XLogRecPtr local_commit);
+extern void SetupCachedReplicationIdentifier(RepNodeId node);
+extern void TeardownCachedReplicationIdentifier(void);
+extern XLogRecPtr CachedReplicationIdentifierProgress(void);
+
+/* crash recovery support */
+extern void CheckPointReplicationIdentifier(void);
+extern void StartupReplicationIdentifier(void);
+
+/* internals */
+extern Size ReplicationIdentifierShmemSize(void);
+extern void ReplicationIdentifierShmemInit(void);
+
+/* SQL callable functions */
+extern Datum pg_replication_identifier_get(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_create(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_drop(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_setup_replaying_from(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_reset_replaying_from(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_is_replaying(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_setup_tx_origin(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_progress(PG_FUNCTION_ARGS);
+extern Datum pg_get_replication_identifier_progress(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_advance(PG_FUNCTION_ARGS);
+
+#endif
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e3c2efc..919708b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -134,8 +134,9 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
#define ReplicationSlotControlLock (&MainLWLockArray[37].lock)
#define CommitTsControlLock (&MainLWLockArray[38].lock)
#define CommitTsLock (&MainLWLockArray[39].lock)
+#define ReplicationIdentifierLock (&MainLWLockArray[40].lock)
-#define NUM_INDIVIDUAL_LWLOCKS 40
+#define NUM_INDIVIDUAL_LWLOCKS 41
/*
* It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index ba0b090..d7be45a 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,8 @@ enum SysCacheIdentifier
RANGETYPE,
RELNAMENSP,
RELOID,
+ REPLIDIDENT,
+ REPLIDREMOTE,
RULERELNAME,
STATRELATTINH,
TABLESPACEOID,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 1788270..c6c6d3d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1390,6 +1390,11 @@ pg_prepared_xacts| SELECT p.transaction,
FROM ((pg_prepared_xact() p(transaction, gid, prepared, ownerid, dbid)
LEFT JOIN pg_authid u ON ((p.ownerid = u.oid)))
LEFT JOIN pg_database d ON ((p.dbid = d.oid)));
+pg_replication_identifier_progress| SELECT pg_get_replication_identifier_progress.local_id,
+ pg_get_replication_identifier_progress.external_id,
+ pg_get_replication_identifier_progress.remote_lsn,
+ pg_get_replication_identifier_progress.local_lsn
+ FROM pg_get_replication_identifier_progress() pg_get_replication_identifier_progress(local_id, external_id, remote_lsn, local_lsn);
pg_replication_slots| SELECT l.slot_name,
l.plugin,
l.slot_type,
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index c7be273..400cba3 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -121,6 +121,7 @@ pg_pltemplate|t
pg_policy|t
pg_proc|t
pg_range|t
+pg_replication_identifier|t
pg_rewrite|t
pg_seclabel|t
pg_shdepend|t
--
2.3.0.149.gf3f4077
On 24/03/15 16:33, Andres Freund wrote:
Hi,
Here's the next version of this patch. I've tried to address the biggest
issue (documentation) and some more. Now that both the more flexible
commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it
looks much cleaner.
Nice, I see you also did the more close integration with CommitTs. I
only skimmed the patch but so far and it looks quite good, I'll take
closer look around end of the week.
I'd greatly appreciate some feedback on the documentation. I'm not
entirely sure into how much detail to go; and where exactly in the docs
to place it. I do wonder if we shouldn't merge this with the logical
decoding section and whether we could also document commit timestamps
somewhere in there.
Perhaps we should have some Logical replication developer documentation
section and put all those three as subsections of that?
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.
You might notice that Heikki is making the same point here that I've
attempted to make multiple times in the past: limiting to replication
identifier to 2 bytes because that's how much padding space you happen
to have available is optimizing for the wrong thing. What we should
be optimizing for is consistency and uniformity of design. System
catalogs have OIDs, so this one should, too. You're not going to be
able to paper over the fact that the column has some funky data type
that is unlike every other column in the system.
To the best of my knowledge, the statement that there is a noticeable
performance cost for those 2 extra bytes is also completely
unsupported by any actual benchmarking.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
So I did some more in depth look, I do have couple of comments.
I would really like to have something like "Logical Replication
Infrastructure" doc section that would have both decoding and
identifiers (and possibly even CommitTs) underneath.
There is typo in docs:
+ <para> + The optional <function>filter_by_origin_cb</function> callback + is called to determine wheter data that has been replayed
wheter -> whether
And finally I have issue with how the new identifiers are allocated.
Currently, if you create identifier 'foo', remove identifier 'foo' and
create identifier 'bar', the identifier 'bar' will have same id as the
old 'foo' identifier. This can be problem because the identifier id is
used as origin of the data and the replication solution using the
replication identifiers can end up thinking that data came from node
'bar' even though they came from the node 'foo' which no longer exists.
This can have bad effects for example on conflict detection or debugging
problems with replication.
Maybe another reason to use standard Oids?
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote:
And finally I have issue with how the new identifiers are allocated.
Currently, if you create identifier 'foo', remove identifier 'foo' and
create identifier 'bar', the identifier 'bar' will have same id as the old
'foo' identifier. This can be problem because the identifier id is used as
origin of the data and the replication solution using the replication
identifiers can end up thinking that data came from node 'bar' even though
they came from the node 'foo' which no longer exists. This can have bad
effects for example on conflict detection or debugging problems with
replication.Maybe another reason to use standard Oids?
As the same reason exists for oids, just somewhat less likely, I don't
see it as a reason for much. It's really not that hard to get oid
conflicts once your server has lived for a while. As soon as the oid
counter has wrapped around once, it's not unlikely to have
conflicts. And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point. The only material
difference this makes is that it's much easier to notice the problem
during development.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point.
Oh, and obviously toast data. A couple tables with toasted columns is
also a good way to rapidly consume oids.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-03-24 23:11:26 -0400, Robert Haas wrote:
On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.You might notice that Heikki is making the same point here that I've
attempted to make multiple times in the past: limiting to replication
identifier to 2 bytes because that's how much padding space you happen
to have available is optimizing for the wrong thing. What we should
be optimizing for is consistency and uniformity of design. System
catalogs have OIDs, so this one should, too. You're not going to be
able to paper over the fact that the column has some funky data type
that is unlike every other column in the system.To the best of my knowledge, the statement that there is a noticeable
performance cost for those 2 extra bytes is also completely
unsupported by any actual benchmarking.
I'm starting benchmarks now.
But I have to say: I find the idea that you'd need more than 2^16
identifiers anytime soon not very credible. The likelihood that
replication identifiers are the limiting factor towards that seems
incredibly small. Just consider how you'd apply changes from so many
remotes; how to stream changes to them; how to even configure such a
complex setup. We can easily change the size limits in the next major
release without anybody being inconvenienced.
We've gone through quite some lengths reducing the overhead of WAL. I
don't understand why it's important that we do not make compromises
here; but why that doesn't matter elsewhere.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/7/15 9:30 AM, Andres Freund wrote:
On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote:
And finally I have issue with how the new identifiers are allocated.
Currently, if you create identifier 'foo', remove identifier 'foo' and
create identifier 'bar', the identifier 'bar' will have same id as the old
'foo' identifier. This can be problem because the identifier id is used as
origin of the data and the replication solution using the replication
identifiers can end up thinking that data came from node 'bar' even though
they came from the node 'foo' which no longer exists. This can have bad
effects for example on conflict detection or debugging problems with
replication.Maybe another reason to use standard Oids?
As the same reason exists for oids, just somewhat less likely, I don't
see it as a reason for much. It's really not that hard to get oid
conflicts once your server has lived for a while. As soon as the oid
counter has wrapped around once, it's not unlikely to have
conflicts. And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point. The only material
difference this makes is that it's much easier to notice the problem
during development.
Why not just create a sequence? I suspect it may not be as fast to
assign as an OID, but it's not like you'd be doing this all the time.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Why not just create a sequence? I suspect it may not be as fast to assign as
an OID, but it's not like you'd be doing this all the time.
What does that have to do with the thread?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/7/15 10:58 AM, Andres Freund wrote:
Why not just create a sequence? I suspect it may not be as fast to assign as
an OID, but it's not like you'd be doing this all the time.What does that have to do with the thread?
The original bit was...
And finally I have issue with how the new identifiers are allocated.
Currently, if you create identifier 'foo', remove identifier 'foo' and
create identifier 'bar', the identifier 'bar' will have same id as the old
'foo' identifier. This can be problem because the identifier id is used as
origin of the data and the replication solution using the replication
identifiers can end up thinking that data came from node 'bar' even though
they came from the node 'foo' which no longer exists. This can have bad
effects for example on conflict detection or debugging problems with
replication.Maybe another reason to use standard Oids?
Wasn't the reason for using OIDs so that we're not doing the equivalent
of max(identifier) + 1?
Perhaps I'm just confused...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point.Oh, and obviously toast data. A couple tables with toasted columns is
also a good way to rapidly consume oids.
You are forgetting as well large objects on the stack, when client
application does not assign an OID by itself.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/04/15 06:59, Michael Paquier wrote:
On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point.Oh, and obviously toast data. A couple tables with toasted columns is
also a good way to rapidly consume oids.You are forgetting as well large objects on the stack, when client
application does not assign an OID by itself.
And you guys are not getting my point. What I proposed was to not reuse
the RI id immediately because that can make debugging issues with
replication/conflict handling harder when something happens after
cluster configuration has changed. Whether it's done using Oid or some
other way, I don't really care and wrapping around eventually is ok,
since the old origin info for transactions will be cleared out during
the freeze at the latest anyway.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-08 14:17:04 +0200, Petr Jelinek wrote:
And you guys are not getting my point. What I proposed was to not reuse the
RI id immediately because that can make debugging issues with
replication/conflict handling harder when something happens after cluster
configuration has changed.
If that's the goal, you shouldn't delete the replication identifier at
that point. That's the only sane way preventing it from being reused.
Whether it's done using Oid or some other way, I don't really care and
wrapping around eventually is ok, since the old origin info for
transactions will be cleared out during the freeze at the latest
anyway.
How are you proposing to do the allocation then? There's no magic
preventing immediate reuse with oids or anything else. The oid counter
might *already* have wrapped around and point exactly to the identifier
you're about to delete. Then when you deleted it it's going to be reused
for the next allocated oid.
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/04/15 14:22, Andres Freund wrote:
On 2015-04-08 14:17:04 +0200, Petr Jelinek wrote:
And you guys are not getting my point. What I proposed was to not reuse the
RI id immediately because that can make debugging issues with
replication/conflict handling harder when something happens after cluster
configuration has changed.If that's the goal, you shouldn't delete the replication identifier at
that point. That's the only sane way preventing it from being reused.
Ok, I am happy with that solution.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-07 17:08:16 +0200, Andres Freund wrote:
I'm starting benchmarks now.
What I'm benchmarking here is the WAL overhead, since that's what we're
debating.
The test setup I used was a pgbench scale 10 instance. I've run with
full_page_write=off to have more reproducible results. This of course
over-emphasizes the overhead a bit. But for a long checkpoint interval
and a memory resident working set it's not that unrealistic.
I ran 50k transactions in a signle b
baseline:
- 20445024
- 20437128
- 20436864
- avg: 20439672
extern 2byte identifiers:
- 23318368
- 23148648
- 23128016
- avg: 23198344
- avg overhead: 13.5%
padding 2byte identifiers:
- 21160408
- 21319720
- 21164280
- avg: 21214802
- avg overhead: 3.8%
extern 4byte identifiers:
- 23514216
- 23540128
- 23523080
- avg: 23525808
- avg overhead: 15.1%
To me that shows pretty clearly that a) reusing the padding is
worthwhile b) even without that using 2byte instead of 4 byte
identifiers is beneficial.
Now. Especially in the case of extern identifiers we *can* optimize a
bit more. But there's no way we can get the efficiency of the version
reusing padding.
To run the benchmarks you need to
SELECT pg_replication_identifier_create('frak');
before starting pgbench with the attached file.
Greetings,
Andres Freund
Attachments:
On 10/04/15 18:03, Andres Freund wrote:
On 2015-04-07 17:08:16 +0200, Andres Freund wrote:
I'm starting benchmarks now.
What I'm benchmarking here is the WAL overhead, since that's what we're
debating.The test setup I used was a pgbench scale 10 instance. I've run with
full_page_write=off to have more reproducible results. This of course
over-emphasizes the overhead a bit. But for a long checkpoint interval
and a memory resident working set it's not that unrealistic.I ran 50k transactions in a signle b
baseline:
- 20445024
- 20437128
- 20436864
- avg: 20439672
extern 2byte identifiers:
- 23318368
- 23148648
- 23128016
- avg: 23198344
- avg overhead: 13.5%
padding 2byte identifiers:
- 21160408
- 21319720
- 21164280
- avg: 21214802
- avg overhead: 3.8%
extern 4byte identifiers:
- 23514216
- 23540128
- 23523080
- avg: 23525808
- avg overhead: 15.1%To me that shows pretty clearly that a) reusing the padding is
worthwhile b) even without that using 2byte instead of 4 byte
identifiers is beneficial.
My opinion is that 10% of WAL size difference is quite high price to pay
so that we can keep the padding for some other, yet unknown feature that
hasn't come up in several years, which would need those 2 bytes.
But if we are willing to pay it then we can really go all the way and
just use Oids...
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/12/2015 02:56 AM, Petr Jelinek wrote:
On 10/04/15 18:03, Andres Freund wrote:
On 2015-04-07 17:08:16 +0200, Andres Freund wrote:
I'm starting benchmarks now.
What I'm benchmarking here is the WAL overhead, since that's what we're
debating.The test setup I used was a pgbench scale 10 instance. I've run with
full_page_write=off to have more reproducible results. This of course
over-emphasizes the overhead a bit. But for a long checkpoint interval
and a memory resident working set it's not that unrealistic.I ran 50k transactions in a signle b
baseline:
- 20445024
- 20437128
- 20436864
- avg: 20439672
extern 2byte identifiers:
- 23318368
- 23148648
- 23128016
- avg: 23198344
- avg overhead: 13.5%
padding 2byte identifiers:
- 21160408
- 21319720
- 21164280
- avg: 21214802
- avg overhead: 3.8%
extern 4byte identifiers:
- 23514216
- 23540128
- 23523080
- avg: 23525808
- avg overhead: 15.1%To me that shows pretty clearly that a) reusing the padding is
worthwhile b) even without that using 2byte instead of 4 byte
identifiers is beneficial.My opinion is that 10% of WAL size difference is quite high price to pay
so that we can keep the padding for some other, yet unknown feature that
hasn't come up in several years, which would need those 2 bytes.But if we are willing to pay it then we can really go all the way and
just use Oids...
This needs to be weighed against removing the padding bytes altogether.
See attached. That would reduce the WAL size further when you don't need
replication IDs. It's very straightforward, but need to do some
performance/scalability testing to make sure that using memcpy instead
of a straight 32-bit assignment doesn't hurt performance, since it
happens in very performance critical paths.
I'm surprised there's such a big difference between the "extern" and
"padding" versions above. At a quick approximation, storing the ID as a
separate "fragment", along with XLogRecordDataHeaderShort and
XLogRecordDataHeaderLong, should add one byte of overhead plus the ID
itself. So that would be 3 extra bytes for 2-byte identifiers, or 5
bytes for 4-byte identifiers. Does that mean that the average record
length is only about 30 bytes? That's what it seems like, if adding the
"extern 2 byte identifiers" added about 10% of overhead compared to the
"padding 2 byte identifiers" version. That doesn't sound right, 30 bytes
is very little. Perhaps the size of the records created by pgbench
happen to cross a 8-byte alignment boundary at that point, making a big
difference. In another workload, there might be no difference at all,
due to alignment.
Also, you don't need to tag every record type with the replication ID.
All indexam records can skip it, for starters, since logical decoding
doesn't care about them. That should remove a fair amount of bloat.
- Heikki
Attachments:
remove-xlogrecord-padding-1.patchapplication/x-patch; name=remove-xlogrecord-padding-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24cf520..09934f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -963,10 +963,10 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
* Now that xl_prev has been filled in, calculate CRC of the record
* header.
*/
- rdata_crc = rechdr->xl_crc;
+ memcpy(&rdata_crc, rechdr->xl_crc, sizeof(pg_crc32));
COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
FIN_CRC32C(rdata_crc);
- rechdr->xl_crc = rdata_crc;
+ memcpy(rechdr->xl_crc, &rdata_crc, sizeof(pg_crc32));
/*
* All the record data, including the header, is now ready to be
@@ -4685,7 +4685,7 @@ BootStrapXLOG(void)
COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
FIN_CRC32C(crc);
- record->xl_crc = crc;
+ memcpy(record->xl_crc, &crc, sizeof(pg_crc32));
/* Create first XLOG segment file */
use_existent = false;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 88209c3..fbe97b1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -724,7 +724,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
rechdr->xl_info = info;
rechdr->xl_rmid = rmid;
rechdr->xl_prev = InvalidXLogRecPtr;
- rechdr->xl_crc = rdata_crc;
+ memcpy(rechdr->xl_crc, &rdata_crc, sizeof(pg_crc32));
return &hdr_rdt;
}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a4124d9..80a48ba 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -666,6 +666,9 @@ static bool
ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
{
pg_crc32 crc;
+ pg_crc32 rec_crc;
+
+ memcpy(&rec_crc, record->xl_crc, sizeof(pg_crc32));
/* Calculate the CRC */
INIT_CRC32C(crc);
@@ -674,7 +677,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
FIN_CRC32C(crc);
- if (!EQ_CRC32C(record->xl_crc, crc))
+ if (!EQ_CRC32C(rec_crc, crc))
{
report_invalid_record(state,
"incorrect resource manager data checksum in record at %X/%X",
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 3361111..6af27f2 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -1101,7 +1101,7 @@ WriteEmptyXLOG(void)
COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
FIN_CRC32C(crc);
- record->xl_crc = crc;
+ memcpy(record->xl_crc, &crc, sizeof(pg_crc32));
/* Write the first page */
XLogFilePath(path, ControlFile.checkPointCopy.ThisTimeLineID, newXlogSegNo);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 09bbcb1..e26f354 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -45,14 +45,14 @@ typedef struct XLogRecord
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
- /* 2 bytes of padding here, initialize to zero */
- pg_crc32 xl_crc; /* CRC for this record */
+ uint8 xl_crc[4]; /* CRC for this record. (as a byte array rather
+ * than pg_crc32 to avoid padding) */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;
-#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32))
+#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + 4 * sizeof(uint8))
/*
* The high 4 bits in xl_info may be used freely by rmgr. The
On 2015-04-12 22:02:38 +0300, Heikki Linnakangas wrote:
This needs to be weighed against removing the padding bytes
altogether.
Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable. And you deemed to be perfectly ok to use them up to avoid
*increasing* the WAL size with the *additional data* (which so far
nothing but pg_rewind needs in that way). While it perfectly well could
have been used to shrink the WAL size to less than it now is. And that's
*far*, *far* harder to back out/refactor changes than this (which are
pretty localized and thus can easily be changed); to the point that I
think it's infeasible to do so...
If you want to shrink the WAL size, send in a patch independently. Not
as a way to block somebody else implementing something.
I'm surprised there's such a big difference between the "extern" and
"padding" versions above. At a quick approximation, storing the ID as a
separate "fragment", along with XLogRecordDataHeaderShort and
XLogRecordDataHeaderLong, should add one byte of overhead plus the ID
itself. So that would be 3 extra bytes for 2-byte identifiers, or 5 bytes
for 4-byte identifiers. Does that mean that the average record length is
only about 30 bytes?
Yes, nearly. That's xlogdump --stats=record from the above scenario with
replication identifiers used and reusing the padding:
Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Transaction/COMMIT 50003 ( 16.89) 2600496 ( 23.38) 0 ( -nan) 2600496 ( 23.38)
CLOG/ZEROPAGE 1 ( 0.00) 28 ( 0.00) 0 ( -nan) 28 ( 0.00)
Standby/RUNNING_XACTS 5 ( 0.00) 248 ( 0.00) 0 ( -nan) 248 ( 0.00)
Heap2/CLEAN 46034 ( 15.55) 1473088 ( 13.24) 0 ( -nan) 1473088 ( 13.24)
Heap2/VISIBLE 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00)
Heap/INSERT 49682 ( 16.78) 1341414 ( 12.06) 0 ( -nan) 1341414 ( 12.06)
Heap/HOT_UPDATE 150013 ( 50.67) 5700494 ( 51.24) 0 ( -nan) 5700494 ( 51.24)
Heap/INPLACE 5 ( 0.00) 130 ( 0.00) 0 ( -nan) 130 ( 0.00)
Heap/INSERT+INIT 318 ( 0.11) 8586 ( 0.08) 0 ( -nan) 8586 ( 0.08)
Btree/VACUUM 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00)
-------- -------- -------- --------
Total 296065 11124596 [100.00%] 0 [0.00%] 11124596 [100%
(The FPI percentage display above is arguably borked. Interesting.)
So the average record size is ~37.5 bytes including the increased commit
record size due to the origin information (which is the part that
increases the size for that version that reuses the padding).
This *most definitely* isn't representative of every workload. But it
*is* *a* common type of workload.
Note that --stats will *not* show the size difference in xlog records
when adding data as an additional chunk vs. padding as it uses
XLogRecGetDataLen() to compute the record length... That confused me for
a while.
That doesn't sound right, 30 bytes is very little.
Well, it's mostly HOT_UPDATES and INSERTS into not indexed tables. So
that's not too surprising. Obviously that'd look different with FPIs
enabled.
Perhaps the size
of the records created by pgbench happen to cross a 8-byte alignment
boundary at that point, making a big difference. In another workload,
there might be no difference at all, due to alignment.
Right.
Also, you don't need to tag every record type with the replication ID. All
indexam records can skip it, for starters, since logical decoding doesn't
care about them. That should remove a fair amount of bloat.
Yes. I mentioned that. It's additional complexity because now the
decision has to be made at each xlog insertion callsite. Making
refactoring this into a different representation a bit harder. I don't
think it will make that much of a differenced in the above workload
(just CLEAN will be smaller); but it clearly might in others.
I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward. I
still don't see a point in using 4 byte identifiers atm, given the above
numbers that just seems like a waste for unrealistic use cases (>2^16
nodes). It's just two lines to change if we feel the need in the future.
Working on fixing the issue with WAL logging of deletions and
rearranging docs as Petr suggested. Not sure if the latter will really
look good, but I guess we'll see ;)
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Introduce-replication-identifiers-v1.1.patchtext/x-patch; charset=us-asciiDownload
>From 841733fff1394eaafb25272e12cee92d4c94906c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 9 Apr 2015 15:01:00 +0200
Subject: [PATCH] Introduce replication identifiers: v1.1
Replication identifiers are used to identify nodes in a replication
setup, identify changes that are created due to replication and to keep
track of replication progress.
Primarily this is useful because solving these in other ways is
possible, but ends up being much less efficient and more complicated. We
don't want to require replication solutions to reimplement logic for
this independently. The infrastructure is intended to be generic enough
to be reusable.
This infrastructure replaces the 'nodeid' infrastructure of commit
timestamps. Except that there's only 2^16 identifiers, the
infrastructure provided here integrates with logical replication and is
available via SQL. Since the commit timestamp infrastructure has also
been introduced in 9.5 that's not a problem.
For now the number of nodes whose replication progress can be tracked is
determined by the max_replication_slots GUC. It's not perfect to reuse
that GUC, but there doesn't seem to be sufficient reason to introduce a
separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Robert Haas, Heikki Linnakangas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
---
contrib/test_decoding/Makefile | 3 +-
contrib/test_decoding/expected/replident.out | 127 ++
contrib/test_decoding/sql/replident.sql | 58 +
contrib/test_decoding/test_decoding.c | 28 +
doc/src/sgml/catalogs.sgml | 124 ++
doc/src/sgml/filelist.sgml | 1 +
doc/src/sgml/func.sgml | 162 ++-
doc/src/sgml/logicaldecoding.sgml | 35 +-
doc/src/sgml/postgres.sgml | 1 +
doc/src/sgml/replication-identifiers.sgml | 89 ++
src/backend/access/heap/heapam.c | 19 +
src/backend/access/rmgrdesc/xactdesc.c | 24 +-
src/backend/access/transam/commit_ts.c | 53 +-
src/backend/access/transam/xact.c | 72 +-
src/backend/access/transam/xlog.c | 8 +
src/backend/access/transam/xloginsert.c | 32 +-
src/backend/access/transam/xlogreader.c | 6 +
src/backend/catalog/Makefile | 2 +-
src/backend/catalog/catalog.c | 8 +-
src/backend/catalog/system_views.sql | 7 +
src/backend/replication/logical/Makefile | 3 +-
src/backend/replication/logical/decode.c | 48 +-
src/backend/replication/logical/logical.c | 33 +
src/backend/replication/logical/reorderbuffer.c | 5 +-
.../replication/logical/replication_identifier.c | 1296 ++++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/cache/syscache.c | 23 +
src/bin/pg_resetxlog/pg_resetxlog.c | 3 +
src/include/access/commit_ts.h | 14 +-
src/include/access/xact.h | 11 +
src/include/access/xlog.h | 1 +
src/include/access/xlog_internal.h | 2 +-
src/include/access/xlogdefs.h | 6 +
src/include/access/xloginsert.h | 1 +
src/include/access/xlogreader.h | 3 +
src/include/access/xlogrecord.h | 3 +
src/include/catalog/catversion.h | 2 +-
src/include/catalog/indexing.h | 6 +
src/include/catalog/pg_proc.h | 30 +
src/include/catalog/pg_replication_identifier.h | 74 ++
src/include/replication/logical.h | 2 +
src/include/replication/output_plugin.h | 8 +
src/include/replication/reorderbuffer.h | 8 +-
src/include/replication/replication_identifier.h | 62 +
src/include/storage/lwlock.h | 3 +-
src/include/utils/syscache.h | 2 +
src/test/regress/expected/rules.out | 5 +
src/test/regress/expected/sanity_check.out | 1 +
48 files changed, 2432 insertions(+), 85 deletions(-)
create mode 100644 contrib/test_decoding/expected/replident.out
create mode 100644 contrib/test_decoding/sql/replident.sql
create mode 100644 doc/src/sgml/replication-identifiers.sgml
create mode 100644 src/backend/replication/logical/replication_identifier.c
create mode 100644 src/include/catalog/pg_replication_identifier.h
create mode 100644 src/include/replication/replication_identifier.h
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 438be44..f8334cc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -37,7 +37,8 @@ submake-isolation:
submake-test_decoding:
$(MAKE) -C $(top_builddir)/contrib/test_decoding
-REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel binary prepared
+REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel \
+ binary prepared replident
regresscheck: all | submake-regress submake-test_decoding
$(MKDIR_P) regression_output
diff --git a/contrib/test_decoding/expected/replident.out b/contrib/test_decoding/expected/replident.out
new file mode 100644
index 0000000..f6dc404
--- /dev/null
+++ b/contrib/test_decoding/expected/replident.out
@@ -0,0 +1,127 @@
+-- predictability
+SET synchronous_commit = on;
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+ pg_replication_identifier_create
+----------------------------------
+ 1
+(1 row)
+
+-- ensure duplicate creations fail
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+ERROR: duplicate key value violates unique constraint "pg_replication_identifier_riname_index"
+DETAIL: Key (riname)=(test_decoding: regression_slot) already exists.
+--ensure deletions work (once)
+SELECT pg_replication_identifier_create('test_decoding: temp');
+ pg_replication_identifier_create
+----------------------------------
+ 2
+(1 row)
+
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+ pg_replication_identifier_drop
+--------------------------------
+
+(1 row)
+
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+ERROR: cache lookup failed for replication identifier named test_decoding: temp
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ BEGIN
+ table public.target_tbl: INSERT: id[integer]:1 data[text]:'BEGIN'
+ table public.target_tbl: INSERT: id[integer]:2 data[text]:'table public.origin_tbl: INSERT: id[integer]:1 data[text]:''will be replicated and decoded and decoded again'''
+ table public.target_tbl: INSERT: id[integer]:3 data[text]:'COMMIT'
+ COMMIT
+(5 rows)
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+-- mark session as replaying
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+ pg_replication_identifier_setup_replaying_from
+------------------------------------------------
+
+(1 row)
+
+-- ensure we prevent duplicate setup
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+ERROR: cannot setup replication origin when one is already setup
+BEGIN;
+-- setup transaction origins
+SELECT pg_replication_identifier_setup_tx_origin('0/ffffffff', '2013-01-01 00:00');
+ pg_replication_identifier_setup_tx_origin
+-------------------------------------------
+
+(1 row)
+
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+SELECT pg_replication_identifier_reset_replaying_from();
+ pg_replication_identifier_reset_replaying_from
+------------------------------------------------
+
+(1 row)
+
+SELECT local_id, external_id, remote_lsn, local_lsn <> '0/0' FROM pg_replication_identifier_progress;
+ local_id | external_id | remote_lsn | ?column?
+----------+--------------------------------+------------+----------
+ 1 | test_decoding: regression_slot | 0/FFFFFFFF | t
+(1 row)
+
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', false);
+ pg_replication_identifier_progress
+------------------------------------
+ 0/FFFFFFFF
+(1 row)
+
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', true);
+ pg_replication_identifier_progress
+------------------------------------
+ 0/FFFFFFFF
+(1 row)
+
+-- ensure reset requires previously setup state
+SELECT pg_replication_identifier_reset_replaying_from();
+ERROR: no replication identifier is set up
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+------
+(0 rows)
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+--------------------------------------------------------------------------------
+ BEGIN
+ table public.origin_tbl: INSERT: id[integer]:3 data[text]:'will be replicated'
+ COMMIT
+(3 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_replication_identifier_drop('test_decoding: regression_slot');
+ pg_replication_identifier_drop
+--------------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/replident.sql b/contrib/test_decoding/sql/replident.sql
new file mode 100644
index 0000000..d5ba486
--- /dev/null
+++ b/contrib/test_decoding/sql/replident.sql
@@ -0,0 +1,58 @@
+-- predictability
+SET synchronous_commit = on;
+
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+-- ensure duplicate creations fail
+SELECT pg_replication_identifier_create('test_decoding: regression_slot');
+
+--ensure deletions work (once)
+SELECT pg_replication_identifier_create('test_decoding: temp');
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+SELECT pg_replication_identifier_drop('test_decoding: temp');
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+
+-- mark session as replaying
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+
+-- ensure we prevent duplicate setup
+SELECT pg_replication_identifier_setup_replaying_from('test_decoding: regression_slot');
+
+BEGIN;
+-- setup transaction origins
+SELECT pg_replication_identifier_setup_tx_origin('0/ffffffff', '2013-01-01 00:00');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+
+SELECT pg_replication_identifier_reset_replaying_from();
+
+SELECT local_id, external_id, remote_lsn, local_lsn <> '0/0' FROM pg_replication_identifier_progress;
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', false);
+SELECT pg_replication_identifier_progress('test_decoding: regression_slot', true);
+
+-- ensure reset requires previously setup state
+SELECT pg_replication_identifier_reset_replaying_from();
+
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_replication_identifier_drop('test_decoding: regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 963d5df..2ec3001 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -21,6 +21,7 @@
#include "replication/output_plugin.h"
#include "replication/logical.h"
+#include "replication/replication_identifier.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -43,6 +44,7 @@ typedef struct
bool include_timestamp;
bool skip_empty_xacts;
bool xact_wrote_changes;
+ bool only_local;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +61,8 @@ static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
+static bool pg_decode_filter(LogicalDecodingContext *ctx,
+ RepNodeId origin_id);
void
_PG_init(void)
@@ -76,6 +80,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
}
@@ -97,6 +102,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_xids = true;
data->include_timestamp = false;
data->skip_empty_xacts = false;
+ data->only_local = false;
ctx->output_plugin_private = data;
@@ -155,6 +161,17 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "only-local") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->only_local = true;
+ else if (!parse_bool(strVal(elem->arg), &data->only_local))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -223,6 +240,17 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+static bool
+pg_decode_filter(LogicalDecodingContext *ctx,
+ RepNodeId origin_id)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->only_local && origin_id != InvalidRepNodeId)
+ return true;
+ return false;
+}
+
/*
* Print literal `outputstr' already represented as string of type `typid'
* into stringbuf `s'.
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index d0b78f2..f5ee567 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -239,6 +239,16 @@
</row>
<row>
+ <entry><link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link></entry>
+ <entry>registered replication identifiers</entry>
+ </row>
+
+ <row>
+ <entry><link linkend="catalog-pg-replication-identifier-progress"><structname>pg_replication_identifier_progress</structname></link></entry>
+ <entry>information about logical replication progress</entry>
+ </row>
+
+ <row>
<entry><link linkend="catalog-pg-replication-slots"><structname>pg_replication_slots</structname></link></entry>
<entry>replication slot information</entry>
</row>
@@ -5323,6 +5333,120 @@
</sect1>
+ <sect1 id="catalog-pg-replication-identifier">
+ <title><structname>pg_replication_identifier</structname></title>
+
+ <indexterm zone="catalog-pg-replication-identifier">
+ <primary>pg_replication_identifier</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_replication_identifier</structname> catalog
+ contains all replication identifiers created. For more on
+ replication identifiers
+ see <xref linkend="replication-identifiers">.
+ </para>
+
+ <table>
+
+ <title><structname>pg_replication_identifier</structname> Columns</title>
+
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Type</entry>
+ <entry>References</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>riident</structfield></entry>
+ <entry><type>Oid</type></entry>
+ <entry></entry>
+ <entry>A unique, cluster-wide identifier for the replication
+ identifier. Should never leave the system.</entry>
+ </row>
+
+ <row>
+ <entry><structfield>riname</structfield></entry>
+ <entry><type>text</type></entry>
+ <entry></entry>
+ <entry>The external, user defined, name of a replication
+ identifier.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
+ <sect1 id="catalog-pg-replication-identifier-progress">
+ <title><structname>pg_replication_identifier_progress</structname></title>
+
+ <indexterm zone="catalog-pg-replication-identifier-progress">
+ <primary>pg_replication_identifier_progress</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_replication_identifier_progress</structname>
+ view contains information about how far replication for a certain
+ replication identifier has progressed. For more on replication
+ identifiers see <xref linkend="replication-identifiers">.
+ </para>
+
+ <table>
+
+ <title><structname>pg_replication_identifier_progress</structname> Columns</title>
+
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Type</entry>
+ <entry>References</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>local_id</structfield></entry>
+ <entry><type>Oid</type></entry>
+ <entry><literal><link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link>.riident</literal></entry>
+ <entry>internal node identifier</entry>
+ </row>
+
+ <row>
+ <entry><structfield>external_id</structfield></entry>
+ <entry><type>text</type></entry>
+ <entry><literal><link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link>.riname</literal></entry>
+ <entry>external node identifier</entry>
+ </row>
+
+ <row>
+ <entry><structfield>remote_lsn</structfield></entry>
+ <entry><type>pg_lsn</type></entry>
+ <entry></entry>
+ <entry>The origin node's LSN up to which data has been replicated.</entry>
+ </row>
+
+
+ <row>
+ <entry><structfield>local_lsn</structfield></entry>
+ <entry><type>pg_lsn</type></entry>
+ <entry></entry>
+ <entry>This node's LSN that at
+ which <literal>remote_lsn</literal> has been replicated. Used to
+ flush commit records before persisting data to disk when using
+ asynchronous commits.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
<sect1 id="catalog-pg-replication-slots">
<title><structname>pg_replication_slots</structname></title>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 2d7514c..00cc456 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -95,6 +95,7 @@
<!ENTITY fdwhandler SYSTEM "fdwhandler.sgml">
<!ENTITY custom-scan SYSTEM "custom-scan.sgml">
<!ENTITY logicaldecoding SYSTEM "logicaldecoding.sgml">
+<!ENTITY replication-identifiers SYSTEM "replication-identifiers.sgml">
<!ENTITY protocol SYSTEM "protocol.sgml">
<!ENTITY sources SYSTEM "sources.sgml">
<!ENTITY storage SYSTEM "storage.sgml">
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5f7bf6a..8cce9a3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16876,9 +16876,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<para>
The functions shown in <xref linkend="functions-replication-table"> are
for controlling and interacting with replication features.
- See <xref linkend="streaming-replication">
- and <xref linkend="streaming-replication-slots"> for information about the
- underlying features. Use of these functions is restricted to superusers.
+ See <xref linkend="streaming-replication">,
+ <xref linkend="streaming-replication-slots">, <xref linkend="replication-identifiers">
+ for information about the underlying features. Use of these
+ functions is restricted to superusers.
</para>
<para>
@@ -17035,6 +17036,161 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
on future calls.
</entry>
</row>
+
+ <row id="replication-identifier-create">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_create</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_create(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ <parameter>internal_id</parameter> <type>oid</type>
+ </entry>
+ <entry>
+ Create a replication identifier based on the passed in
+ external name, and create an internal id for it.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_get</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_get(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ <parameter>internal_id</parameter> <type>oid</type>
+ </entry>
+ <entry>
+ Lookup replication identifier and return the internal id. If
+ no replication identifier is found a error is thrown.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-setup-replaying-from">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_setup_replaying_from</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_setup_replaying_from(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Signal that the current session is replaying from the passed
+ in node. This will mark changes and transactions emitted by
+ session to be marked as originating from that node. Normal
+ operation can be resumed using
+ <function>pg_replication_identifier_reset_replaying_from</function>. Can
+ only be used if no previous origin is configured.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_reset_replaying_from</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_reset_replaying_from(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Teardown configured replication identifier setup by
+ <function>pg_replication_identifier_setup_replaying_from</function>.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-setup-tx-origin">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_setup_tx_origin</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_setup_tx_origin(<parameter>origin_lsn</parameter> <type>pg_lsn</type>, <parameter>origin_timestamp</parameter> <type>timestamptz</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Mark the current transaction to be replication a transaction
+ that has committed at the passed in <acronym>LSN</acronym> and
+ timestamp. Can only be called when a replication origin has
+ previously been configured using
+ <function>pg_replication_identifier_setup_replaying_from</function>.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_is_replaying</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_is_replaying()</function></literal>
+ </entry>
+ <entry>
+ bool
+ </entry>
+ <entry>
+ Has a replication identifer been setup in the current session?
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_advance</primary>
+ </indexterm>
+ <literal>pg_replication_identifier_advance<function>(<parameter>node_name</parameter> <type>text</type>, <parameter>pos</parameter> <type>pg_lsn</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Set replication progress for the passed in node to the passed
+ in position. This primarily is useful for setting up the
+ initial position or a new position after configuration changes
+ and similar. Be aware that careless use of this function can
+ lead to inconsistently replicated data.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-drop">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_drop</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_drop(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Delete a previously created replication identifier.
+ </entry>
+ </row>
+
+ <row id="replication-identifier-progress">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_identifier_progress</primary>
+ </indexterm>
+ <literal><function>pg_replication_identifier_progress(<parameter>node_name</parameter> <type>text</type>, <parameter>flush</parameter> <type>bool</type>)</function></literal>
+ </entry>
+ <entry>
+ pg_lsn
+ </entry>
+ <entry>
+ Return the replay position for the passed in replication
+ identifier. The parameter <parameter>flush</parameter>
+ determines whether the corresponding local transaction will be
+ guaranteed to have been flushed to disk or not.
+ </entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 3650567..c84a1769 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -363,6 +363,7 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -370,7 +371,8 @@ typedef void (*LogicalOutputPluginInit)(struct OutputPluginCallbacks *cb);
</programlisting>
The <function>begin_cb</function>, <function>change_cb</function>
and <function>commit_cb</function> callbacks are required,
- while <function>startup_cb</function>
+ while <function>startup_cb</function>,
+ <function>filter_by_origin_cb</function>
and <function>shutdown_cb</function> are optional.
</para>
</sect2>
@@ -569,6 +571,37 @@ typedef void (*LogicalDecodeChangeCB) (
</para>
</note>
</sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-filter-by-origin">
+ <title>Origin Filter Callback</title>
+
+ <para>
+ The optional <function>filter_by_origin_cb</function> callback
+ is called to determine wheter data that has been replayed
+ from <parameter>origin_id</parameter> is of interest to the
+ output plugin.
+<programlisting>
+typedef bool (*LogicalDecodeChangeCB) (
+ struct LogicalDecodingContext *ctx,
+ RepNodeId origin_id
+);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. No information but the origin is
+ available. To signal that changes originating on the passed in
+ node are irrelevant, return true, causing them to be filtered
+ away; false otherwise. The other callbacks will not be called
+ for transactions and changes that have been filtered away.
+ </para>
+ <para>
+ This is useful when implementing cascading or multi directional
+ replication solutions. Filtering by the origin allows to
+ prevent replicating the same changes back and forth in such
+ setups. While transactions and changes also carry information
+ about the origin, filtering via this callback is noticeably
+ more efficient.
+ </para>
+ </sect3>
</sect2>
<sect2 id="logicaldecoding-output-plugin-output">
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index e378d69..5e2eacb 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -220,6 +220,7 @@
&spi;
&bgworker;
&logicaldecoding;
+ &replication-identifiers;
</part>
diff --git a/doc/src/sgml/replication-identifiers.sgml b/doc/src/sgml/replication-identifiers.sgml
new file mode 100644
index 0000000..707a4e5
--- /dev/null
+++ b/doc/src/sgml/replication-identifiers.sgml
@@ -0,0 +1,89 @@
+<!-- doc/src/sgml/replication-identifiers.sgml -->
+<chapter id="replication-identifiers">
+ <title>Replication Identifiers</title>
+ <indexterm zone="replication-identifiers">
+ <primary>Replication Identifiers</primary>
+ </indexterm>
+
+ <para>
+ Replication identifiers are intended to make it easier to implement
+ logical replication solutions on top
+ of <xref linkend="logicaldecoding">. They provide a solution to two
+ common problems:
+ <itemizedlist>
+ <listitem><para>How to safely keep track of replication progress</para></listitem>
+ <listitem><para>How to change replication behavior, based on the
+ origin of a row; e.g. to avoid loops in bi-directional replication
+ setups</para></listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ Replication identifiers consist out of a external name, and a
+ internal identifier. The external identifier is free-form. It should
+ be used in a way that makes conflicts between replication
+ identifiers created by different replication solutions unlikely;
+ e.g. by prefixing the replication solution's name. The internal
+ identifier is used only to avoid having to store the long version in
+ situations where space efficiency is important. It should never be
+ shared between systems.
+ </para>
+
+ <para>
+ Replication identifiers can be created using the
+ <link linkend="replication-identifier-create"><function>pg_replication_identifier_create()</function></link>;
+ dropped using
+ <link linkend="replication-identifier-drop"><function>pg_replication_identifier_drop()</function></link>;
+ and seen in the
+ <link linkend="catalog-pg-replication-identifier"><structname>pg_replication_identifier</structname></link>
+ catalog.
+ </para>
+
+ <para>
+ When replicating from one system to another (independent of the fact
+ that those two might be in the same cluster, or even same database)
+ one nontrivial part of building a replication solution is to keep
+ track of replication progress. When the applying process or the
+ whole cluster dies, it needs to be able to find out up to where data
+ has successfully been replicated. Naive solutions to this like
+ updating a row in a table for every replayed transaction have
+ problems like bloat.
+ </para>
+
+ <para>
+ Using the replication identifier infrastructure a session can be
+ marked as replaying from a remote node (using the
+ <link linkend="replication-identifier-setup-replaying-from"><function>pg_replication_identifier_setup_replaying_from()</function></link>
+ function. Additionally the <acronym>LSN</acronym> and commit
+ timestamp of every source transaction can be configured on a per
+ transaction basis using
+ <link linkend="replication-identifier-setup-tx-origin"><function>pg_replication_identifier_setup_tx_origin()</function></link>.
+ If that's done replication progress will be persist in a crash safe
+ manner. Replication progress for all replication identifiers can be
+ seen in the
+ <link linkend="catalog-pg-replication-identifier-progress">
+ <structname>pg_replication_progress</structname>
+ </link> view. A individual identifier's progress, e.g. when resuming
+ replication, can be acquired using
+ <link linkend="replication-identifier-progress"><function>pg_replication_identifier_progress()</function></link>
+ </para>
+
+ <para>
+ In more complex replication topologies than replication from exactly
+ one system to one other another problem can be that it's hard to
+ avoid replicating replicated rows again. That can lead both to
+ cycles in the replication and inefficiencies. Replication
+ identifiers provide a, optional, mechanism to recognize and prevent
+ that. When setup using the functions referenced in the previous
+ paragraph every change and transaction passed to output plugin
+ callbacks (see <xref linkend="logicaldecoding-output-plugin">)
+ generated by the session is tagged with the replication identifier
+ of the generating session. This allows to treat them differently in
+ the output plugin, e.g. ignoring all but locally originating rows.
+ Additionally the <link linkend="logicaldecoding-output-plugin-filter-by-origin">
+ <function>filter_by_origin_cb</function></link> callback can be used
+ to filter the logical decoding change stream based on the
+ source. While less flexible, filtering via that callback is
+ considerably more efficient.
+ </para>
+</chapter>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 457cd70..b504ccd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2189,6 +2189,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
(char *) heaptup->t_data + SizeofHeapTupleHeader,
heaptup->t_len - SizeofHeapTupleHeader);
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP_ID, info);
PageSetLSN(page, recptr);
@@ -2499,6 +2502,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
XLogRegisterBufData(0, tupledata, totaldatalen);
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP2_ID, info);
PageSetLSN(page, recptr);
@@ -2920,6 +2927,9 @@ l1:
- SizeofHeapTupleHeader);
}
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
PageSetLSN(page, recptr);
@@ -4650,6 +4660,8 @@ failed:
tuple->t_data->t_infomask2);
XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ /* we don't decode row locks atm, so no need to log the origin */
+
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
PageSetLSN(page, recptr);
@@ -5429,6 +5441,8 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
+ /* inplace updates aren't decoded atm, don't log the origin */
+
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
PageSetLSN(page, recptr);
@@ -6787,6 +6801,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
old_key_tuple->t_len - SizeofHeapTupleHeader);
}
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP_ID, info);
return recptr;
@@ -6860,6 +6877,8 @@ log_heap_new_cid(Relation relation, HeapTuple tup)
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapNewCid);
+ /* will be looked at irrespective of origin */
+
recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_NEW_CID);
return recptr;
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index b036b6d..4df0bce 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -101,6 +101,16 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
data += sizeof(xl_xact_twophase);
}
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin *xl_origin = (xl_xact_origin *) data;
+
+ parsed->origin_lsn = xl_origin->origin_lsn;
+ parsed->origin_timestamp = xl_origin->origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
+ }
}
void
@@ -156,7 +166,7 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
}
static void
-xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec)
+xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepNodeId origin_id)
{
xl_xact_parsed_commit parsed;
int i;
@@ -218,6 +228,15 @@ xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec)
if (XactCompletionForceSyncCommit(parsed.xinfo))
appendStringInfo(buf, "; sync");
+
+ if (parsed.xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ appendStringInfo(buf, "; origin: node %u, lsn %X/%X, at %s",
+ origin_id,
+ (uint32)(parsed.origin_lsn >> 32),
+ (uint32)parsed.origin_lsn,
+ timestamptz_to_str(parsed.origin_timestamp));
+ }
}
static void
@@ -274,7 +293,8 @@ xact_desc(StringInfo buf, XLogReaderState *record)
{
xl_xact_commit *xlrec = (xl_xact_commit *) rec;
- xact_desc_commit(buf, XLogRecGetInfo(record), xlrec);
+ xact_desc_commit(buf, XLogRecGetInfo(record), xlrec,
+ XLogRecGetOrigin(record));
}
else if (info == XLOG_XACT_ABORT || info == XLOG_XACT_ABORT_PREPARED)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index dc23ab2..ffc3466 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -49,18 +49,18 @@
*/
/*
- * We need 8+4 bytes per xact. Note that enlarging this struct might mean
+ * We need 8+2 bytes per xact. Note that enlarging this struct might mean
* the largest possible file name is more than 5 chars long; see
* SlruScanDirectory.
*/
typedef struct CommitTimestampEntry
{
TimestampTz time;
- CommitTsNodeId nodeid;
+ RepNodeId nodeid;
} CommitTimestampEntry;
#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
- sizeof(CommitTsNodeId))
+ sizeof(RepNodeId))
#define COMMIT_TS_XACTS_PER_PAGE \
(BLCKSZ / SizeOfCommitTimestampEntry)
@@ -93,43 +93,18 @@ CommitTimestampShared *commitTsShared;
/* GUC variable */
bool track_commit_timestamp;
-static CommitTsNodeId default_node_id = InvalidCommitTsNodeId;
-
static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz ts,
- CommitTsNodeId nodeid, int pageno);
+ RepNodeId nodeid, int pageno);
static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
- CommitTsNodeId nodeid, int slotno);
+ RepNodeId nodeid, int slotno);
static int ZeroCommitTsPage(int pageno, bool writeXlog);
static bool CommitTsPagePrecedes(int page1, int page2);
static void WriteZeroPageXlogRec(int pageno);
static void WriteTruncateXlogRec(int pageno);
static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid);
-
-
-/*
- * CommitTsSetDefaultNodeId
- *
- * Set default nodeid for current backend.
- */
-void
-CommitTsSetDefaultNodeId(CommitTsNodeId nodeid)
-{
- default_node_id = nodeid;
-}
-
-/*
- * CommitTsGetDefaultNodeId
- *
- * Set default nodeid for current backend.
- */
-CommitTsNodeId
-CommitTsGetDefaultNodeId(void)
-{
- return default_node_id;
-}
+ RepNodeId nodeid);
/*
* TransactionTreeSetCommitTsData
@@ -156,7 +131,7 @@ CommitTsGetDefaultNodeId(void)
void
TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid, bool do_xlog)
+ RepNodeId nodeid, bool do_xlog)
{
int i;
TransactionId headxid;
@@ -234,7 +209,7 @@ TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
static void
SetXidCommitTsInPage(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz ts,
- CommitTsNodeId nodeid, int pageno)
+ RepNodeId nodeid, int pageno)
{
int slotno;
int i;
@@ -259,7 +234,7 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
- CommitTsNodeId nodeid, int slotno)
+ RepNodeId nodeid, int slotno)
{
int entryno = TransactionIdToCTsEntry(xid);
CommitTimestampEntry entry;
@@ -282,7 +257,7 @@ TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
*/
bool
TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
- CommitTsNodeId *nodeid)
+ RepNodeId *nodeid)
{
int pageno = TransactionIdToCTsPage(xid);
int entryno = TransactionIdToCTsEntry(xid);
@@ -322,7 +297,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (ts)
*ts = 0;
if (nodeid)
- *nodeid = InvalidCommitTsNodeId;
+ *nodeid = InvalidRepNodeId;
return false;
}
@@ -373,7 +348,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
* as NULL if not wanted.
*/
TransactionId
-GetLatestCommitTsData(TimestampTz *ts, CommitTsNodeId *nodeid)
+GetLatestCommitTsData(TimestampTz *ts, RepNodeId *nodeid)
{
TransactionId xid;
@@ -503,7 +478,7 @@ CommitTsShmemInit(void)
commitTsShared->xidLastCommit = InvalidTransactionId;
TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
- commitTsShared->dataLastCommit.nodeid = InvalidCommitTsNodeId;
+ commitTsShared->dataLastCommit.nodeid = InvalidRepNodeId;
}
else
Assert(found);
@@ -857,7 +832,7 @@ WriteTruncateXlogRec(int pageno)
static void
WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid)
+ RepNodeId nodeid)
{
xl_commit_ts_set record;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1495bb4..a9c5a73 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -40,8 +40,10 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/logical.h"
#include "replication/walsender.h"
#include "replication/syncrep.h"
+#include "replication/replication_identifier.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -1073,21 +1075,22 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
InvalidTransactionId /* plain commit */);
- }
- /*
- * We only need to log the commit timestamp separately if the node
- * identifier is a valid value; the commit record above already contains
- * the timestamp info otherwise, and will be used to load it.
- */
- if (markXidCommitted)
- {
- CommitTsNodeId node_id;
+ /* record plain commit ts if not replaying remote actions */
+ if (replication_origin_id == InvalidRepNodeId ||
+ replication_origin_id == DoNotReplicateRepNodeId)
+ replication_origin_timestamp = xactStopTimestamp;
+ else
+ AdvanceCachedReplicationIdentifier(replication_origin_lsn,
+ XactLastRecEnd);
- node_id = CommitTsGetDefaultNodeId();
+ /*
+ * We don't need to WAL log here, the commit record contains all the
+ * necessary information and will redo the SET action during replay.
+ */
TransactionTreeSetCommitTsData(xid, nchildren, children,
- xactStopTimestamp,
- node_id, node_id != InvalidCommitTsNodeId);
+ replication_origin_timestamp,
+ replication_origin_id, false);
}
/*
@@ -1176,9 +1179,11 @@ RecordTransactionCommit(void)
if (wrote_xlog && markXidCommitted)
SyncRepWaitForLSN(XactLastRecEnd);
+ /* remember end of last commit record */
+ XactLastCommitEnd = XactLastRecEnd;
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
-
cleanup:
/* Clean up local data */
if (rels)
@@ -4611,6 +4616,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_invals xl_invals;
xl_xact_twophase xl_twophase;
+ xl_xact_origin xl_origin;
uint8 info;
@@ -4668,6 +4674,15 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_twophase.xid = twophase_xid;
}
+ /* dump transaction origin information */
+ if (replication_origin_id != InvalidRepNodeId)
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replication_origin_lsn;
+ xl_origin.origin_timestamp = replication_origin_timestamp;
+ }
+
if (xl_xinfo.xinfo != 0)
info |= XLOG_XACT_HAS_INFO;
@@ -4709,6 +4724,12 @@ XactLogCommitRecord(TimestampTz commit_time,
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ /* we allow filtering by xacts */
+ XLogIncludeOrigin();
+
return XLogInsert(RM_XACT_ID, info);
}
@@ -4806,10 +4827,12 @@ XactLogAbortRecord(TimestampTz abort_time,
static void
xact_redo_commit(xl_xact_parsed_commit *parsed,
TransactionId xid,
- XLogRecPtr lsn)
+ XLogRecPtr lsn,
+ RepNodeId origin_id)
{
TransactionId max_xid;
int i;
+ TimestampTz commit_time;
max_xid = TransactionIdLatest(xid, parsed->nsubxacts, parsed->subxacts);
@@ -4829,9 +4852,16 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
LWLockRelease(XidGenLock);
}
+ Assert(!!(parsed->xinfo & XACT_XINFO_HAS_ORIGIN) == (origin_id != InvalidRepNodeId));
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ commit_time = parsed->origin_timestamp;
+ else
+ commit_time = parsed->xact_time;
+
/* Set the transaction commit timestamp and metadata */
TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
- parsed->xact_time, InvalidCommitTsNodeId,
+ commit_time, origin_id,
false);
if (standbyState == STANDBY_DISABLED)
@@ -4892,6 +4922,14 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
StandbyReleaseLockTree(xid, 0, NULL);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ /* recover apply progress */
+ AdvanceReplicationIdentifier(origin_id,
+ parsed->origin_lsn,
+ lsn);
+ }
+
/* Make sure files supposed to be dropped are dropped */
if (parsed->nrels > 0)
{
@@ -5047,13 +5085,13 @@ xact_redo(XLogReaderState *record)
{
Assert(!TransactionIdIsValid(parsed.twophase_xid));
xact_redo_commit(&parsed, XLogRecGetXid(record),
- record->EndRecPtr);
+ record->EndRecPtr, XLogRecGetOrigin(record));
}
else
{
Assert(TransactionIdIsValid(parsed.twophase_xid));
xact_redo_commit(&parsed, parsed.twophase_xid,
- record->EndRecPtr);
+ record->EndRecPtr, XLogRecGetOrigin(record));
RemoveTwoPhaseFile(parsed.twophase_xid, false);
}
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..10fab1e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
#include "postmaster/startup.h"
#include "replication/logical.h"
#include "replication/slot.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
@@ -295,6 +296,7 @@ static TimeLineID curFileTLI;
static XLogRecPtr ProcLastRecPtr = InvalidXLogRecPtr;
XLogRecPtr XactLastRecEnd = InvalidXLogRecPtr;
+XLogRecPtr XactLastCommitEnd = InvalidXLogRecPtr;
/*
* RedoRecPtr is this backend's local copy of the REDO record pointer
@@ -6212,6 +6214,11 @@ StartupXLOG(void)
StartupMultiXact();
/*
+ * Recover knowledge about replay progress of known replication partners.
+ */
+ StartupReplicationIdentifier();
+
+ /*
* Initialize unlogged LSN. On a clean shutdown, it's restored from the
* control file. On recovery, all unlogged relations are blown away, so
* the unlogged LSN counter can be reset too.
@@ -8394,6 +8401,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointBuffers(flags); /* performs all required fsyncs */
+ CheckPointReplicationIdentifier();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 618f879..cf56124 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -26,6 +26,7 @@
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
#include "miscadmin.h"
+#include "replication/replication_identifier.h"
#include "storage/bufmgr.h"
#include "storage/proc.h"
#include "utils/memutils.h"
@@ -72,6 +73,9 @@ static XLogRecData *mainrdata_head;
static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
static uint32 mainrdata_len; /* total # of bytes in chain */
+/* Should te in-progress insertion log the origin */
+static bool include_origin = false;
+
/*
* These are used to hold the record header while constructing a record.
* 'hdr_scratch' is not a plain variable, but is palloc'd at initialization,
@@ -83,10 +87,12 @@ static uint32 mainrdata_len; /* total # of bytes in chain */
static XLogRecData hdr_rdt;
static char *hdr_scratch = NULL;
+#define SizeOfXlogOrigin (sizeof(RepNodeId) + sizeof(char))
+
#define HEADER_SCRATCH_SIZE \
(SizeOfXLogRecord + \
MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
- SizeOfXLogRecordDataHeaderLong)
+ SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
/*
* An array of XLogRecData structs, to hold registered data.
@@ -193,6 +199,7 @@ XLogResetInsertion(void)
max_registered_block_id = 0;
mainrdata_len = 0;
mainrdata_last = (XLogRecData *) &mainrdata_head;
+ include_origin = false;
begininsert_called = false;
}
@@ -375,6 +382,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
}
/*
+ * Should this record include the replication origin if one is set up?
+ */
+void
+XLogIncludeOrigin(void)
+{
+ Assert(begininsert_called);
+ include_origin = true;
+}
+
+/*
* Insert an XLOG record having the specified RMID and info bytes, with the
* body of the record being the data and buffer references registered earlier
* with XLogRegister* calls.
@@ -678,6 +695,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
scratch += sizeof(BlockNumber);
}
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+ /* followed by the record's origin, if any */
+ if (include_origin && replication_origin_id != InvalidRepNodeId)
+ {
+ *(scratch++) = XLR_BLOCK_ID_ORIGIN;
+ memcpy(scratch, &replication_origin_id, sizeof(replication_origin_id));
+ scratch += sizeof(replication_origin_id);
+ }
+#endif
+
/* followed by main data, if any */
if (mainrdata_len > 0)
{
@@ -723,6 +750,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
rechdr->xl_tot_len = total_len;
rechdr->xl_info = info;
rechdr->xl_rmid = rmid;
+#ifdef REPLICATION_IDENTIFIER_REUSE_PADDING
+ rechdr->xl_origin_id = replication_origin_id;
+#endif
rechdr->xl_prev = InvalidXLogRecPtr;
rechdr->xl_crc = rdata_crc;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 77be1b8..17880d7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -21,6 +21,7 @@
#include "access/xlogreader.h"
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
+#include "replication/replication_identifier.h"
static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -975,6 +976,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
ResetDecoder(state);
state->decoded_record = record;
+ state->record_origin = InvalidRepNodeId;
ptr = (char *) record;
ptr += SizeOfXLogRecord;
@@ -1009,6 +1011,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
break; /* by convention, the main data fragment is
* always last */
}
+ else if (block_id == XLR_BLOCK_ID_ORIGIN)
+ {
+ COPY_HEADER_FIELD(&state->record_origin, sizeof(RepNodeId));
+ }
else if (block_id <= XLR_MAX_BLOCK_ID)
{
/* XLogRecordBlockHeader */
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index a403c64..5b04550 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -39,7 +39,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
pg_ts_config.h pg_ts_config_map.h pg_ts_dict.h \
pg_ts_parser.h pg_ts_template.h pg_extension.h \
pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
- pg_foreign_table.h pg_policy.h \
+ pg_foreign_table.h pg_policy.h pg_replication_identifier.h \
pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \
toasting.h indexing.h \
)
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index e9d3cdc..00c4393 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -32,6 +32,7 @@
#include "catalog/pg_namespace.h"
#include "catalog/pg_pltemplate.h"
#include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_replication_identifier.h"
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
@@ -224,7 +225,8 @@ IsSharedRelation(Oid relationId)
relationId == SharedDependRelationId ||
relationId == SharedSecLabelRelationId ||
relationId == TableSpaceRelationId ||
- relationId == DbRoleSettingRelationId)
+ relationId == DbRoleSettingRelationId ||
+ relationId == ReplicationIdentifierRelationId)
return true;
/* These are their indexes (see indexing.h) */
if (relationId == AuthIdRolnameIndexId ||
@@ -240,7 +242,9 @@ IsSharedRelation(Oid relationId)
relationId == SharedSecLabelObjectIndexId ||
relationId == TablespaceOidIndexId ||
relationId == TablespaceNameIndexId ||
- relationId == DbRoleSettingDatidRolidIndexId)
+ relationId == DbRoleSettingDatidRolidIndexId ||
+ relationId == ReplicationLocalIdentIndex ||
+ relationId == ReplicationExternalIdentIndex)
return true;
/* These are their toast tables and toast indexes (see toasting.h) */
if (relationId == PgShdescriptionToastTable ||
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4fd88f..3ecc16c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -777,6 +777,13 @@ CREATE VIEW pg_user_mappings AS
REVOKE ALL on pg_user_mapping FROM public;
+
+CREATE VIEW pg_replication_identifier_progress AS
+ SELECT *
+ FROM pg_get_replication_identifier_progress();
+
+REVOKE ALL ON pg_replication_identifier_progress FROM public;
+
--
-- We have a few function definitions in here, too.
-- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index 310a45c..95bcffb 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
-OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o snapbuild.o
+OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o replication_identifier.o \
+ snapbuild.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eb7293f..5003e59 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -40,6 +40,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "storage/standby.h"
@@ -422,6 +423,15 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
}
}
+static inline bool
+FilterByOrigin(LogicalDecodingContext *ctx, RepNodeId origin_id)
+{
+ if (ctx->callbacks.filter_by_origin_cb == NULL)
+ return false;
+
+ return filter_by_origin_cb_wrapper(ctx, origin_id);
+}
+
/*
* Consolidated commit record handling between the different form of commit
* records.
@@ -430,8 +440,17 @@ static void
DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid)
{
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ XLogRecPtr commit_time = InvalidXLogRecPtr;
+ XLogRecPtr origin_id = InvalidRepNodeId;
int i;
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
/*
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
@@ -452,12 +471,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* the reorderbuffer to forget the content of the (sub-)transactions
* if not.
*
- * There basically two reasons we might not be interested in this
+ * There can be several reasons we might not be interested in this
* transaction:
* 1) We might not be interested in decoding transactions up to this
* LSN. This can happen because we previously decoded it and now just
* are restarting or if we haven't assembled a consistent snapshot yet.
* 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
*
* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
@@ -472,7 +492,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* ---
*/
if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
- (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database))
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
{
for (i = 0; i < parsed->nsubxacts; i++)
{
@@ -492,7 +513,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
/* replay actions of all transaction + subtransactions in order */
ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- parsed->xact_time);
+ commit_time, origin_id, origin_lsn);
}
/*
@@ -537,8 +558,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -579,8 +605,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_UPDATE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -628,8 +659,13 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_DELETE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
@@ -673,6 +709,10 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (rnode.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
tupledata = XLogRecGetBlockData(r, 0, &tuplelen);
data = tupledata;
@@ -685,6 +725,8 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
+
memcpy(&change->data.tp.relnode, &rnode, sizeof(RelFileNode));
/*
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 774ebbc..b60a1df 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -39,6 +39,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/replication_identifier.h"
#include "replication/snapbuild.h"
#include "storage/proc.h"
@@ -46,6 +47,10 @@
#include "utils/memutils.h"
+RepNodeId replication_origin_id = InvalidRepNodeId; /* assumed identity */
+XLogRecPtr replication_origin_lsn;
+TimestampTz replication_origin_timestamp;
+
/* data for errcontext callback */
typedef struct LogicalErrorCallbackState
{
@@ -720,6 +725,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+bool
+filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepNodeId origin_id)
+{
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "shutdown";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_by_origin_cb(ctx, origin_id);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
/*
* Set the required catalog xmin horizon for historic snapshots in the current
* replication slot.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index dc85583..e37a736 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1255,7 +1255,8 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
void
ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time)
+ TimestampTz commit_time,
+ RepNodeId origin_id, XLogRecPtr origin_lsn)
{
ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
@@ -1273,6 +1274,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
/* serialize the last bunch of changes if we need start earlier anyway */
if (txn->nentries_mem != txn->nentries)
diff --git a/src/backend/replication/logical/replication_identifier.c b/src/backend/replication/logical/replication_identifier.c
new file mode 100644
index 0000000..ef3f511
--- /dev/null
+++ b/src/backend/replication/logical/replication_identifier.c
@@ -0,0 +1,1296 @@
+/*-------------------------------------------------------------------------
+ *
+ * replication_identifier.c
+ * Logical Replication Node Identifier and replication progress persistency
+ * support.
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/logical/replication_identifier.c
+ *
+ * NOTES
+ *
+ * This file provides the following:
+ * * Interface functions for naming nodes in a replication setup
+ * * A facility to efficiently store and persist replication progress in a
+ * efficient and durable manner.
+ *
+ * Replication identifiers consist out of a descriptive, user defined,
+ * external name and a short, thus space efficient, internal 2 byte one. This
+ * split exists because replication identifiers have to be stored in WAL and
+ * shared memory and long descriptors would be inefficient. For now only use
+ * 2 bytes for the internal id of a replication identifier as it seems
+ * unlikely that there soon will be more than 65k nodes in one replication
+ * setup; and using only two bytes allow us to be more space efficient.
+ *
+ * Replication progress is tracked in a shared memory table
+ * (ReplicationStates) that's dumped to disk every checkpoint. Entries
+ * ('slots') in this table are identified by the internal id. That's the case
+ * because it allows to increase replication progress during crash
+ * recovery. To allow doing so we store the original LSN (from the originating
+ * system) of a transaction in the commit record. That allows to recover the
+ * precise replayed state after crash recovery; without requiring synchronous
+ * commits. Allowing logical replication to use asynchronous commit is
+ * generally good for performance, but especially important as it allows a
+ * single threaded replay process to keep up with a source that has multiple
+ * backends generating changes concurrently. For efficiency and simplicity
+ * reasons a backend can setup a replication identifier as its origin (a
+ * "cached replication identifier") that's from then on the source of changes
+ * produced by the backend, until reset again.
+ *
+ * This infrastructure is intended to be used in cooperation with logical
+ * decoding. When replaying from a remote system the configured origin is
+ * provided to output plugins, allowing filtering and such.
+ *
+ *
+ * There are several levels of locking at work:
+ *
+ * * To create and drop replication identifiers a exclusive lock on
+ * pg_replication_slot is required for the duration. That allows us to
+ * safely and conflict free assign new identifiers using a dirty snapshot.
+ *
+ * * When creating a in-memory replication progress slot the
+ * ReplicationIdentifier LWLock has to be held exclusively; when iterating
+ * over the replication progress a shared lock has to be held, the same when
+ * advancing the replication progress of a individual backend that has not
+ * setup as the backend's cached replication identifier.
+ *
+ * * When manipulating or looking at the remote_lsn and local_lsn fields of a
+ * replication progress slot that slot's spinlock has to be held. That's
+ * primarily because we do not assume 8 byte writes (the LSN) is atomic on
+ * all our platforms, but it also simplifies memory ordering concerns
+ * between the remote and local lsn.
+ *
+ * ---------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <unistd.h>
+#include <sys/stat.h>
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+
+#include "catalog/indexing.h"
+
+#include "nodes/execnodes.h"
+
+#include "replication/replication_identifier.h"
+#include "replication/logical.h"
+
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/copydir.h"
+#include "storage/spin.h"
+
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/pg_lsn.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+
+/*
+ * Replay progress of a single remote node.
+ */
+typedef struct ReplicationState
+{
+ /*
+ * Local identifier for the remote node.
+ */
+ RepNodeId local_identifier;
+
+ /*
+ * Location of the latest commit from the remote side.
+ */
+ XLogRecPtr remote_lsn;
+
+ /*
+ * Remember the local lsn of the commit record so we can XLogFlush() to it
+ * during a checkpoint so we know the commit record actually is safe on
+ * disk.
+ */
+ XLogRecPtr local_lsn;
+
+ /*
+ * Slot is setup in backend?
+ */
+ pid_t acquired_by;
+
+ /*
+ * Spinlock protecting remote_lsn and local_lsn.
+ */
+ slock_t mutex;
+} ReplicationState;
+
+/*
+ * On disk version of ReplicationState.
+ */
+typedef struct ReplicationStateOnDisk
+{
+ RepNodeId local_identifier;
+ XLogRecPtr remote_lsn;
+} ReplicationStateOnDisk;
+
+
+/*
+ * Base address into a shared memory array of replication states of size
+ * max_replication_slots.
+ *
+ * XXX: Should we use a separate variable to size this rather than
+ * max_replication_slots?
+ */
+static ReplicationState *ReplicationStates;
+
+/*
+ * Backend-local, cached element from ReplicationStates for use in a backend
+ * replaying remote commits, so we don't have to search ReplicationStates for
+ * the backends current RepNodeId.
+ */
+static ReplicationState *cached_replication_state = NULL;
+
+/* Magic for on disk files. */
+#define REPLICATION_STATE_MAGIC ((uint32)0x1257DADE)
+
+static void
+CheckReplicationIdentifierPrerequisites(bool check_slots)
+{
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("only superusers can query or manipulate replication identifiers")));
+
+ if (check_slots && max_replication_slots == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot query or manipulate replication identifiers when max_replication_slots = 0")));
+
+}
+
+
+/* ---------------------------------------------------------------------------
+ * Functions for working with replication identifiers themselves.
+ * ---------------------------------------------------------------------------
+ */
+
+/*
+ * Check for a persistent repication identifier identified by the replication
+ * identifier's external name..
+ *
+ * Returns InvalidOid if the node isn't known yet.
+ */
+RepNodeId
+GetReplicationIdentifier(char *riname, bool missing_ok)
+{
+ Form_pg_replication_identifier ident;
+ Oid riident = InvalidOid;
+ HeapTuple tuple;
+ Datum riname_d;
+
+ riname_d = CStringGetTextDatum(riname);
+
+ tuple = SearchSysCache1(REPLIDREMOTE, riname_d);
+ if (HeapTupleIsValid(tuple))
+ {
+ ident = (Form_pg_replication_identifier) GETSTRUCT(tuple);
+ riident = ident->riident;
+ ReleaseSysCache(tuple);
+ }
+ else if (!missing_ok)
+ elog(ERROR, "cache lookup failed for replication identifier named %s",
+ riname);
+
+ return riident;
+}
+
+/*
+ * Create a persistent replication identifier.
+ *
+ * Needs to be called in a transaction.
+ */
+RepNodeId
+CreateReplicationIdentifier(char *riname)
+{
+ Oid riident;
+ HeapTuple tuple = NULL;
+ Relation rel;
+ Datum riname_d;
+ SnapshotData SnapshotDirty;
+ SysScanDesc scan;
+ ScanKeyData key;
+
+ riname_d = CStringGetTextDatum(riname);
+
+ Assert(IsTransactionState());
+
+ /*
+ * We need the numeric replication identifiers to be 16bit wide, so we
+ * cannot rely on the normal oid allocation. So we simply scan
+ * pg_replication_identifier for the first unused id. That's not
+ * particularly efficient, but this should be an fairly infrequent
+ * operation - we can easily spend a bit more code on this when it turns
+ * out it needs to be faster.
+ *
+ * We handle concurrency by taking an exclusive lock (allowing reads!)
+ * over the table for the duration of the search. Because we use a "dirty
+ * snapshot" we can read rows that other in-progress sessions have
+ * written, even though they would be invisible with normal snapshots. Due
+ * to the exclusive lock there's no danger that new rows can appear while
+ * we're checking.
+ */
+ InitDirtySnapshot(SnapshotDirty);
+
+ rel = heap_open(ReplicationIdentifierRelationId, ExclusiveLock);
+
+ for (riident = InvalidOid + 1; riident < UINT16_MAX; riident++)
+ {
+ bool nulls[Natts_pg_replication_identifier];
+ Datum values[Natts_pg_replication_identifier];
+ bool collides;
+ CHECK_FOR_INTERRUPTS();
+
+ ScanKeyInit(&key,
+ Anum_pg_replication_riident,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(riident));
+
+ scan = systable_beginscan(rel, ReplicationLocalIdentIndex,
+ true /* indexOK */,
+ &SnapshotDirty,
+ 1, &key);
+
+ collides = HeapTupleIsValid(systable_getnext(scan));
+
+ systable_endscan(scan);
+
+ if (!collides)
+ {
+ /*
+ * Ok, found an unused riident, insert the new row and do a CCI,
+ * so our callers can look it up if they want to.
+ */
+ memset(&nulls, 0, sizeof(nulls));
+
+ values[Anum_pg_replication_riident -1] = ObjectIdGetDatum(riident);
+ values[Anum_pg_replication_riname - 1] = riname_d;
+
+ tuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
+ simple_heap_insert(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple);
+ CommandCounterIncrement();
+ break;
+ }
+ }
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+
+ if (tuple == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("no free replication id could be found")));
+
+ heap_freetuple(tuple);
+ return riident;
+}
+
+
+/*
+ * Create a persistent replication identifier.
+ *
+ * Needs to be called in a transaction.
+ */
+void
+DropReplicationIdentifier(RepNodeId riident)
+{
+ HeapTuple tuple = NULL;
+ Relation rel;
+ int i;
+
+ Assert(IsTransactionState());
+
+ rel = heap_open(ReplicationIdentifierRelationId, ExclusiveLock);
+
+ /* cleanup the slot state info */
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state = &ReplicationStates[i];
+
+ /* found our slot */
+ if (state->local_identifier == riident)
+ {
+ if (state->acquired_by != 0)
+ {
+ elog(ERROR, "cannot drop slot that is setup in backend %d",
+ state->acquired_by);
+ }
+ /* reset entry */
+ state->local_identifier = InvalidRepNodeId;
+ state->remote_lsn = InvalidXLogRecPtr;
+ state->local_lsn = InvalidXLogRecPtr;
+ break;
+ }
+ }
+ LWLockRelease(ReplicationIdentifierLock);
+
+ tuple = SearchSysCache1(REPLIDIDENT, ObjectIdGetDatum(riident));
+ simple_heap_delete(rel, &tuple->t_self);
+ ReleaseSysCache(tuple);
+
+ CommandCounterIncrement();
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+}
+
+
+/*
+ * Lookup pg_replication_identifier via riident and return the external name.
+ *
+ * The external name is palloc'd in the calling context.
+ *
+ * Returns true if the identifier is known, false otherwise.
+ */
+bool
+GetReplicationInfoByIdentifier(RepNodeId riident, bool missing_ok, char **riname)
+{
+ HeapTuple tuple;
+ Form_pg_replication_identifier ric;
+
+ Assert(OidIsValid((Oid) riident));
+ Assert(riident != InvalidRepNodeId);
+ Assert(riident != DoNotReplicateRepNodeId);
+
+ tuple = SearchSysCache1(REPLIDIDENT,
+ ObjectIdGetDatum((Oid) riident));
+
+ if (HeapTupleIsValid(tuple))
+ {
+ ric = (Form_pg_replication_identifier) GETSTRUCT(tuple);
+ *riname = text_to_cstring(&ric->riname);
+ ReleaseSysCache(tuple);
+
+ return true;
+ }
+ else
+ {
+ *riname = NULL;
+
+ if (!missing_ok)
+ elog(ERROR, "cache lookup failed for replication identifier id: %u",
+ riident);
+
+ return false;
+ }
+}
+
+
+/* ---------------------------------------------------------------------------
+ * Functions for handling replication progress.
+ * ---------------------------------------------------------------------------
+ */
+
+Size
+ReplicationIdentifierShmemSize(void)
+{
+ Size size = 0;
+
+ /*
+ * XXX: max_replication_slots is arguablethe wrong thing to use here, here
+ * we keep the replay state of *remote* transactions. But for now it seems
+ * sufficient to reuse it, lest we introduce a separate guc.
+ */
+ if (max_replication_slots == 0)
+ return size;
+
+ size = add_size(size,
+ mul_size(max_replication_slots, sizeof(ReplicationState)));
+ return size;
+}
+
+void
+ReplicationIdentifierShmemInit(void)
+{
+ bool found;
+
+ if (max_replication_slots == 0)
+ return;
+
+ ReplicationStates = (ReplicationState *)
+ ShmemInitStruct("ReplicationIdentifierState",
+ ReplicationIdentifierShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ int i;
+
+ MemSet(ReplicationStates, 0, ReplicationIdentifierShmemSize());
+
+ for (i = 0; i < max_replication_slots; i++)
+ SpinLockInit(&ReplicationStates[i].mutex);
+ }
+}
+
+/* ---------------------------------------------------------------------------
+ * Perform a checkpoint of replication identifier's progress with respect to
+ * the replayed remote_lsn. Make sure that all transactions we refer to in the
+ * checkpoint (local_lsn) are actually on-disk. This might not yet be the case
+ * if the transactions were originally committed asynchronously.
+ *
+ * We store checkpoints in the following format:
+ * +-------+------------------------+------------------+-----+--------+
+ * | MAGIC | ReplicationStateOnDisk | struct Replic... | ... | CRC32C | EOF
+ * +-------+------------------------+------------------+-----+--------+
+ *
+ * So its just the magic, followed by the statically sized
+ * ReplicationStateOnDisk structs. Note that the maximum number of
+ * ReplicationStates is determined by max_replication_slots.
+ * ---------------------------------------------------------------------------
+ */
+void
+CheckPointReplicationIdentifier(void)
+{
+ const char *tmppath = "pg_logical/replident_checkpoint.tmp";
+ const char *path = "pg_logical/replident_checkpoint";
+ int tmpfd;
+ int i;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ pg_crc32c crc;
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ /* make sure no old temp file is remaining */
+ if (unlink(tmppath) < 0 && errno != ENOENT)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ path)));
+
+ /*
+ * no other backend can perform this at the same time, we're protected by
+ * CheckpointLock.
+ */
+ tmpfd = OpenTransientFile((char *) tmppath,
+ O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
+ S_IRUSR | S_IWUSR);
+ if (tmpfd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m",
+ tmppath)));
+
+ /* write magic */
+ if ((write(tmpfd, &magic, sizeof(magic))) != sizeof(magic))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ /* prevent concurrent creations/drops */
+ LWLockAcquire(ReplicationIdentifierLock, LW_SHARED);
+
+ /* write actual data */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationStateOnDisk disk_state;
+ ReplicationState *curstate = &ReplicationStates[i];
+ XLogRecPtr local_lsn;
+
+ if (curstate->local_identifier == InvalidRepNodeId)
+ continue;
+
+ disk_state.local_identifier = curstate->local_identifier;
+
+ SpinLockAcquire(&curstate->mutex);
+ disk_state.remote_lsn = curstate->remote_lsn;
+ local_lsn = curstate->local_lsn;
+ SpinLockRelease(&curstate->mutex);
+
+ /* make sure we only write out a commit that's persistent */
+ XLogFlush(local_lsn);
+
+ if ((write(tmpfd, &disk_state, sizeof(disk_state))) !=
+ sizeof(disk_state))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+ }
+
+ LWLockRelease(ReplicationIdentifierLock);
+
+ /* write out the CRC */
+ FIN_CRC32C(crc);
+ if ((write(tmpfd, &crc, sizeof(crc))) != sizeof(crc))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+
+ /* fsync the temporary file */
+ if (pg_fsync(tmpfd) != 0)
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync file \"%s\": %m",
+ tmppath)));
+ }
+
+ CloseTransientFile(tmpfd);
+
+ /* rename to permanent file, fsync file and directory */
+ if (rename(tmppath, path) != 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ tmppath, path)));
+ }
+
+ fsync_fname((char *) path, false);
+ fsync_fname("pg_logical", true);
+}
+
+/*
+ * Recover replication replay status from checkpoint data saved earlier by
+ * CheckPointReplicationIdentifier.
+ *
+ * This only needs to be called at startup and *not* during every checkpoint
+ * read during recovery (e.g. in HS or PITR from a base backup) afterwards. All
+ * state thereafter can be recovered by looking at commit records.
+ */
+void
+StartupReplicationIdentifier(void)
+{
+ const char *path = "pg_logical/replident_checkpoint";
+ int fd;
+ int readBytes;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ int last_state = 0;
+ pg_crc32c file_crc;
+ pg_crc32c crc;
+
+ /* don't want to overwrite already existing state */
+#ifdef USE_ASSERT_CHECKING
+ static bool already_started = false;
+ Assert(!already_started);
+ already_started = true;
+#endif
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ elog(LOG, "starting up replication identifiers");
+
+ fd = OpenTransientFile((char *) path, O_RDONLY | PG_BINARY, 0);
+
+ /*
+ * might have had max_replication_slots == 0 last run, or we just brought up a
+ * standby.
+ */
+ if (fd < 0 && errno == ENOENT)
+ return;
+ else if (fd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+
+ /* verify magic, thats written even if nothing was active */
+ readBytes = read(fd, &magic, sizeof(magic));
+ if (readBytes != sizeof(magic))
+ ereport(PANIC,
+ (errmsg("could not read file \"%s\": %m",
+ path)));
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ if (magic != REPLICATION_STATE_MAGIC)
+ ereport(PANIC,
+ (errmsg("replication checkpoint has wrong magic %u instead of %u",
+ magic, REPLICATION_STATE_MAGIC)));
+
+ /* we can skip locking here, no other access is possible */
+
+ /* recover individual states, until there are no more to be found */
+ while (true)
+ {
+ ReplicationStateOnDisk disk_state;
+
+ readBytes = read(fd, &disk_state, sizeof(disk_state));
+
+ /* no further data */
+ if (readBytes == sizeof(crc))
+ {
+ /* not pretty, but simple ... */
+ file_crc = *(pg_crc32c*) &disk_state;
+ break;
+ }
+
+ if (readBytes < 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ path)));
+ }
+
+ if (readBytes != sizeof(disk_state))
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": read %d of %zu",
+ path, readBytes, sizeof(disk_state))));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+
+ if (last_state == max_replication_slots)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state could be found, increase max_replication_slots")));
+
+ /* copy data to shared memory */
+ ReplicationStates[last_state].local_identifier = disk_state.local_identifier;
+ ReplicationStates[last_state].remote_lsn = disk_state.remote_lsn;
+ last_state++;
+
+ elog(LOG, "recovered replication state of node %u to %X/%X",
+ disk_state.local_identifier,
+ (uint32)(disk_state.remote_lsn >> 32),
+ (uint32)disk_state.remote_lsn);
+ }
+
+ /* now check checksum */
+ FIN_CRC32C(crc);
+ if (file_crc != crc)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("replication_slot_checkpoint has wrong checksum %u, expected %u",
+ crc, file_crc)));
+
+ CloseTransientFile(fd);
+}
+
+/*
+ * Tell the replication identifier machinery that a commit from 'node' that
+ * originated at the LSN remote_commit on the remote node was replayed
+ * successfully and that we don't need to do so again. In combination with
+ * setting up replication_origin_lsn and replication_origin_id that ensures we
+ * won't loose knowledge about that after a crash if the the transaction had a
+ * persistent effect (think of asynchronous commits).
+ *
+ * local_commit needs to be a local LSN of the commit so that we can make sure
+ * uppon a checkpoint that enough WAL has been persisted to disk.
+ *
+ * Needs to be called with a RowExclusiveLock on pg_replication_identifier,
+ * unless running in recovery.
+ */
+void
+AdvanceReplicationIdentifier(RepNodeId node,
+ XLogRecPtr remote_commit,
+ XLogRecPtr local_commit)
+{
+ int i;
+ int free_slot = -1;
+ ReplicationState *replication_state = NULL;
+
+ Assert(node != InvalidRepNodeId);
+
+ /* we don't track DoNotReplicateRepNodeId */
+ if (node == DoNotReplicateRepNodeId)
+ return;
+
+ /*
+ * XXX: should we restore into a hashtable and dump into shmem only after
+ * recovery finished?
+ */
+
+ /* Lock exclusively, as we may have to create a new table entry. */
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ /*
+ * Search for either an existing slot for that identifier or a free one we
+ * can use.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &ReplicationStates[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->local_identifier == InvalidRepNodeId &&
+ free_slot == -1)
+ {
+ free_slot = i;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->local_identifier != node)
+ continue;
+
+ if (curstate->acquired_by != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication identiefer %d is already active for pid %d",
+ curstate->local_identifier, curstate->acquired_by)));
+ }
+
+ /* ok, found slot */
+ replication_state = curstate;
+ break;
+ }
+
+ if (replication_state == NULL && free_slot == -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state slot could be found for replication identifier %u",
+ node),
+ errhint("Increase max_replication_slots and try again.")));
+ else if (replication_state == NULL)
+ {
+ /* initialize new slot */
+ replication_state = &ReplicationStates[free_slot];
+ Assert(replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(replication_state->local_lsn == InvalidXLogRecPtr);
+ replication_state->local_identifier = node;
+ }
+
+ Assert(replication_state->local_identifier != InvalidRepNodeId);
+
+ /*
+ * Due to - harmless - race conditions during a checkpoint we could see
+ * values here that are older than the ones we already have in
+ * memory. Don't overwrite those.
+ */
+ SpinLockAcquire(&replication_state->mutex);
+ if (replication_state->remote_lsn < remote_commit)
+ replication_state->remote_lsn = remote_commit;
+ if (replication_state->local_lsn < local_commit)
+ replication_state->local_lsn = local_commit;
+ SpinLockRelease(&replication_state->mutex);
+
+ /*
+ * Release *after* changing the LSNs, slot isn't acquired and thus could
+ * otherwise be dropped anytime.
+ */
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+
+XLogRecPtr
+ReplicationIdentifierProgress(RepNodeId node, bool flush)
+{
+ int i;
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ XLogRecPtr remote_lsn = InvalidXLogRecPtr;
+
+ /* prevent slots from being concurrently dropped */
+ LWLockAcquire(ReplicationIdentifierLock, LW_SHARED);
+
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state;
+
+ state = &ReplicationStates[i];
+
+ if (state->local_identifier == node)
+ {
+ SpinLockAcquire(&state->mutex);
+ remote_lsn = state->remote_lsn;
+ local_lsn = state->local_lsn;
+ SpinLockRelease(&state->mutex);
+ break;
+ }
+ }
+
+ LWLockRelease(ReplicationIdentifierLock);
+
+ if (flush && local_lsn != InvalidXLogRecPtr)
+ XLogFlush(local_lsn);
+
+ return remote_lsn;
+}
+
+/*
+ * Tear down a (possibly) cached replication identifier during process exit.
+ */
+static void
+ReplicationIdentifierExitCleanup(int code, Datum arg)
+{
+
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ if (cached_replication_state != NULL &&
+ cached_replication_state->acquired_by == MyProcPid)
+ {
+ cached_replication_state->acquired_by = 0;
+ cached_replication_state = NULL;
+ }
+
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+/*
+ * Setup a replication identifier in the shared memory struct if it doesn't
+ * already exists and cache access to the specific ReplicationSlot so the
+ * array doesn't have to be searched when calling
+ * AdvanceCachedReplicationIdentifier().
+ *
+ * Obviously only one such cached identifier can exist per process and the
+ * current cached value can only be set again after the previous value is torn
+ * down with TeardownCachedReplicationIdentifier().
+ */
+void
+SetupCachedReplicationIdentifier(RepNodeId node)
+{
+ static bool registered_cleanup;
+ int i;
+ int free_slot = -1;
+
+ if (!registered_cleanup)
+ {
+ on_shmem_exit(ReplicationIdentifierExitCleanup, 0);
+ registered_cleanup = true;
+ }
+
+ Assert(max_replication_slots > 0);
+
+ if (cached_replication_state != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot setup replication origin when one is already setup")));
+
+ /* Lock exclusively, as we may have to create a new table entry. */
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ /*
+ * Search for either an existing slot for that identifier or a free one we
+ * can use.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &ReplicationStates[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->local_identifier == InvalidRepNodeId &&
+ free_slot == -1)
+ {
+ free_slot = i;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->local_identifier != node)
+ continue;
+
+ else if (curstate->acquired_by != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication identiefer %d is already active for pid %d",
+ curstate->local_identifier, curstate->acquired_by)));
+ }
+
+ /* ok, found slot */
+ cached_replication_state = curstate;
+ }
+
+
+ if (cached_replication_state == NULL && free_slot == -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state slot could be found for replication identifier %u",
+ node),
+ errhint("Increase max_replication_slots and try again.")));
+ else if (cached_replication_state == NULL)
+ {
+ /* initialize new slot */
+ cached_replication_state = &ReplicationStates[free_slot];
+ Assert(cached_replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(cached_replication_state->local_lsn == InvalidXLogRecPtr);
+ cached_replication_state->local_identifier = node;
+ }
+
+
+ Assert(cached_replication_state->local_identifier != InvalidRepNodeId);
+
+ cached_replication_state->acquired_by = MyProcPid;
+
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+/*
+ * Make currently cached replication identifier unavailable so a new one can
+ * be setup with SetupCachedReplicationIdentifier().
+ *
+ * This function may only be called if a previous identifier was setup with
+ * SetupCachedReplicationIdentifier().
+ */
+void
+TeardownCachedReplicationIdentifier(void)
+{
+ Assert(max_replication_slots != 0);
+
+ if (cached_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("no replication identifier is set up")));
+
+ LWLockAcquire(ReplicationIdentifierLock, LW_EXCLUSIVE);
+
+ cached_replication_state->acquired_by = 0;
+ cached_replication_state = NULL;
+
+ LWLockRelease(ReplicationIdentifierLock);
+}
+
+/*
+ * Do the same work AdvanceReplicationIdentifier() does, just on a pre-cached
+ * identifier. This is noticeably cheaper if you only ever work on a single
+ * replication identifier.
+ */
+void
+AdvanceCachedReplicationIdentifier(XLogRecPtr remote_commit,
+ XLogRecPtr local_commit)
+{
+ Assert(cached_replication_state != NULL);
+ Assert(cached_replication_state->local_identifier != InvalidRepNodeId);
+
+ SpinLockAcquire(&cached_replication_state->mutex);
+ if (cached_replication_state->local_lsn < local_commit)
+ cached_replication_state->local_lsn = local_commit;
+ if (cached_replication_state->remote_lsn < remote_commit)
+ cached_replication_state->remote_lsn = remote_commit;
+ SpinLockRelease(&cached_replication_state->mutex);
+}
+
+/*
+ * Ask the machinery about the point up to which we successfully replayed
+ * changes from a already setup & cached replication identifier.
+ */
+XLogRecPtr
+CachedReplicationIdentifierProgress(void)
+{
+ XLogRecPtr remote_lsn;
+
+ Assert(cached_replication_state != NULL);
+
+ SpinLockAcquire(&cached_replication_state->mutex);
+ remote_lsn = cached_replication_state->remote_lsn;
+ SpinLockRelease(&cached_replication_state->mutex);
+
+ return remote_lsn;
+}
+
+
+
+/* ---------------------------------------------------------------------------
+ * SQL functions for working with replication identifiers.
+ *
+ * These mostly should be fairly short wrappers around more generic functions.
+ * ---------------------------------------------------------------------------
+ */
+
+/*
+ * Return the internal replication identifier for the passed in external one.
+ */
+Datum
+pg_replication_identifier_get(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ riident = GetReplicationIdentifier(name, true);
+
+ pfree(name);
+
+ if (OidIsValid(riident))
+ PG_RETURN_OID(riident);
+ PG_RETURN_NULL();
+}
+
+/*
+ * Create a replication identifier with the passed in name, and return the
+ * assigned internal identifier.
+ */
+Datum
+pg_replication_identifier_create(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ riident = CreateReplicationIdentifier(name);
+
+ pfree(name);
+
+ PG_RETURN_OID(riident);
+}
+
+/*
+ * Setup a cached replication identifier in the current session.
+ */
+Datum
+pg_replication_identifier_setup_replaying_from(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId origin;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ origin = GetReplicationIdentifier(name, false);
+ SetupCachedReplicationIdentifier(origin);
+
+ replication_origin_id = origin;
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_is_replaying(PG_FUNCTION_ARGS)
+{
+ CheckReplicationIdentifierPrerequisites(false);
+
+ PG_RETURN_BOOL(replication_origin_id != InvalidRepNodeId);
+}
+
+Datum
+pg_replication_identifier_reset_replaying_from(PG_FUNCTION_ARGS)
+{
+ CheckReplicationIdentifierPrerequisites(true);
+
+ TeardownCachedReplicationIdentifier();
+
+ replication_origin_id = InvalidRepNodeId;
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_setup_tx_origin(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr location = PG_GETARG_LSN(0);
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ if (cached_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("need to setup the origin id first")));
+
+ replication_origin_lsn = location;
+ replication_origin_timestamp = PG_GETARG_TIMESTAMPTZ(1);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_advance(PG_FUNCTION_ARGS)
+{
+ text *name = PG_GETARG_TEXT_P(0);
+ XLogRecPtr remote_commit = PG_GETARG_LSN(1);
+ RepNodeId node;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ /* lock to prevent the replication identifier from vanishing */
+ LockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ node = GetReplicationIdentifier(text_to_cstring(name), false);
+
+ /*
+ * Can't sensibly pass a local commit to be flushed at checkpoint - this
+ * xact hasn't committed yet. This is why this function should be used to
+ * set up the intial replication state, but not for replay.
+ */
+ AdvanceReplicationIdentifier(node, remote_commit, InvalidXLogRecPtr);
+
+ UnlockRelationOid(ReplicationIdentifierRelationId, RowExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_identifier_drop(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepNodeId riident;
+
+ CheckReplicationIdentifierPrerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+
+ riident = GetReplicationIdentifier(name, false);
+ Assert(OidIsValid(riident));
+
+ DropReplicationIdentifier(riident);
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Return the replication progress for an individual replication identifier.
+ *
+ * If 'flush' is set to true it is ensured that the returned value corresponds
+ * to a local transaction that has been flushed. this is useful if asychronous
+ * commits are used when replaying replicated transactions.
+ */
+Datum
+pg_replication_identifier_progress(PG_FUNCTION_ARGS)
+{
+ char *name;
+ bool flush;
+ RepNodeId riident;
+ XLogRecPtr remote_lsn = InvalidXLogRecPtr;
+
+ CheckReplicationIdentifierPrerequisites(true);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ flush = PG_GETARG_BOOL(1);
+
+ riident = GetReplicationIdentifier(name, false);
+ Assert(OidIsValid(riident));
+
+ remote_lsn = ReplicationIdentifierProgress(riident, flush);
+
+ if (remote_lsn == InvalidXLogRecPtr)
+ PG_RETURN_NULL();
+
+ PG_RETURN_LSN(remote_lsn);
+}
+
+
+Datum
+pg_get_replication_identifier_progress(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ int i;
+#define REPLICATION_IDENTIFIER_PROGRESS_COLS 4
+
+ /* we we want to return 0 rows if slot is set to zero */
+ CheckReplicationIdentifierPrerequisites(false);
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not allowed in this context")));
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (tupdesc->natts != REPLICATION_IDENTIFIER_PROGRESS_COLS)
+ elog(ERROR, "wrong function definition");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* prevent slots from being concurrently dropped */
+ LWLockAcquire(ReplicationIdentifierLock, LW_SHARED);
+
+ /*
+ * Iterate through all possible ReplicationStates, display if they are
+ * filled. Note that we do not take any locks, so slightly corrupted/out
+ * of date values are a possibility.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state;
+ Datum values[REPLICATION_IDENTIFIER_PROGRESS_COLS];
+ bool nulls[REPLICATION_IDENTIFIER_PROGRESS_COLS];
+ char *riname;
+
+ state = &ReplicationStates[i];
+
+ /* unused slot, nothing to display */
+ if (state->local_identifier == InvalidRepNodeId)
+ continue;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ values[ 0] = ObjectIdGetDatum(state->local_identifier);
+
+ /*
+ * We're not preventing the identifier to be dropped concurrently, so
+ * silently accept that it might be gone.
+ */
+ if (!GetReplicationInfoByIdentifier(state->local_identifier, true,
+ &riname))
+ continue;
+
+ values[ 1] = CStringGetTextDatum(riname);
+
+ SpinLockAcquire(&state->mutex);
+
+ values[ 2] = LSNGetDatum(state->remote_lsn);
+
+ values[ 3] = LSNGetDatum(state->local_lsn);
+
+ SpinLockRelease(&state->mutex);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ tuplestore_donestoring(tupstore);
+
+ LWLockRelease(ReplicationIdentifierLock);
+
+#undef REPLICATION_IDENTIFIER_PROGRESS_COLS
+
+ return (Datum) 0;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 16b9808..e927698 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "replication/slot.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "replication/replication_identifier.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/ipc.h"
@@ -132,6 +133,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
size = add_size(size, CheckpointerShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
+ size = add_size(size, ReplicationIdentifierShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -238,6 +240,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
CheckpointerShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
+ ReplicationIdentifierShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index bd27168..fdccb95 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -54,6 +54,7 @@
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
+#include "catalog/pg_replication_identifier.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_tablespace.h"
#include "catalog/pg_ts_config.h"
@@ -620,6 +621,28 @@ static const struct cachedesc cacheinfo[] = {
},
128
},
+ {ReplicationIdentifierRelationId, /* REPLIDIDENT */
+ ReplicationLocalIdentIndex,
+ 1,
+ {
+ Anum_pg_replication_riident,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
+ {ReplicationIdentifierRelationId, /* REPLIDREMOTE */
+ ReplicationExternalIdentIndex,
+ 1,
+ {
+ Anum_pg_replication_riname,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
{RewriteRelationId, /* RULERELNAME */
RewriteRelRulenameIndexId,
2,
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index a0805d8..c9a7e7a 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -56,6 +56,8 @@
#include "common/restricted_token.h"
#include "storage/large_object.h"
#include "pg_getopt.h"
+#include "replication/logical.h"
+#include "replication/replication_identifier.h"
static ControlFileData ControlFile; /* pg_control values */
@@ -1091,6 +1093,7 @@ WriteEmptyXLOG(void)
record->xl_tot_len = SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort + sizeof(CheckPoint);
record->xl_info = XLOG_CHECKPOINT_SHUTDOWN;
record->xl_rmid = RM_XLOG_ID;
+
recptr += SizeOfXLogRecord;
*(recptr++) = XLR_BLOCK_ID_DATA_SHORT;
*(recptr++) = sizeof(CheckPoint);
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 93d1217..578513d 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -13,6 +13,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
+#include "replication/replication_identifier.h"
#include "utils/guc.h"
@@ -21,18 +22,13 @@ extern PGDLLIMPORT bool track_commit_timestamp;
extern bool check_track_commit_timestamp(bool *newval, void **extra,
GucSource source);
-typedef uint32 CommitTsNodeId;
-#define InvalidCommitTsNodeId 0
-
-extern void CommitTsSetDefaultNodeId(CommitTsNodeId nodeid);
-extern CommitTsNodeId CommitTsGetDefaultNodeId(void);
extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid, bool do_xlog);
+ RepNodeId nodeid, bool do_xlog);
extern bool TransactionIdGetCommitTsData(TransactionId xid,
- TimestampTz *ts, CommitTsNodeId *nodeid);
+ TimestampTz *ts, RepNodeId *nodeid);
extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
- CommitTsNodeId *nodeid);
+ RepNodeId *nodeid);
extern Size CommitTsShmemBuffers(void);
extern Size CommitTsShmemSize(void);
@@ -58,7 +54,7 @@ extern void AdvanceOldestCommitTs(TransactionId oldestXact);
typedef struct xl_commit_ts_set
{
TimestampTz timestamp;
- CommitTsNodeId nodeid;
+ RepNodeId nodeid;
TransactionId mainxid;
/* subxact Xids follow */
} xl_commit_ts_set;
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fdf3ea3..9e78403 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -131,6 +131,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_RELFILENODES (1U << 2)
#define XACT_XINFO_HAS_INVALS (1U << 3)
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
+#define XACT_XINFO_HAS_ORIGIN (1U << 5)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -217,6 +218,12 @@ typedef struct xl_xact_twophase
} xl_xact_twophase;
#define MinSizeOfXactInvals offsetof(xl_xact_invals, msgs)
+typedef struct xl_xact_origin
+{
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_origin;
+
typedef struct xl_xact_commit
{
TimestampTz xact_time; /* time of commit */
@@ -227,6 +234,7 @@ typedef struct xl_xact_commit
/* xl_xact_relfilenodes follows if XINFO_HAS_RELFILENODES */
/* xl_xact_invals follows if XINFO_HAS_INVALS */
/* xl_xact_twophase follows if XINFO_HAS_TWOPHASE */
+ /* xl_xact_origin follows if XINFO_HAS_ORIGIN */
} xl_xact_commit;
#define MinSizeOfXactCommit (offsetof(xl_xact_commit, xact_time) + sizeof(TimestampTz))
@@ -267,6 +275,9 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
typedef struct xl_xact_parsed_abort
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2b1f423..f08b676 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -85,6 +85,7 @@ typedef enum
} RecoveryTargetType;
extern XLogRecPtr XactLastRecEnd;
+extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd;
extern bool reachedConsistency;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index deca1de..75cf435 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
/*
* Each page of XLOG file has a header like this:
*/
-#define XLOG_PAGE_MAGIC 0xD083 /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD085 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 6638c1d..bd8dd70 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -45,6 +45,12 @@ typedef uint64 XLogSegNo;
typedef uint32 TimeLineID;
/*
+ * Denotes the node on which the action causing a wal record to be logged
+ * originated on.
+ */
+typedef uint16 RepNodeId;
+
+/*
* Because O_DIRECT bypasses the kernel buffers, and because we never
* read those buffers except during crash recovery or if wal_level != minimal,
* it is a win to use it in all cases where we sync on each write(). We could
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 6864c95..ac60929 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -39,6 +39,7 @@
/* prototypes for public functions in xloginsert.c: */
extern void XLogBeginInsert(void);
+extern void XLogIncludeOrigin(void);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
extern void XLogRegisterData(char *data, int len);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 609bfe3..efebbf0 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -127,6 +127,8 @@ struct XLogReaderState
uint32 main_data_len; /* main data portion's length */
uint32 main_data_bufsz; /* allocated size of the buffer */
+ RepNodeId record_origin;
+
/* information about blocks referenced by the record. */
DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
@@ -186,6 +188,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
#define XLogRecGetData(decoder) ((decoder)->main_data)
#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index b487ae0..bf6fd41 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -212,5 +212,8 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
+#ifndef REPLICATION_IDENTIFIER_REUSE_PADDING
+#define XLR_BLOCK_ID_ORIGIN 253
+#endif
#endif /* XLOGRECORD_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 8ecd5fd..dce7eaf 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201504121
+#define CATALOG_VERSION_NO 201504122
#endif
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index a680229..405528d 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -305,6 +305,12 @@ DECLARE_UNIQUE_INDEX(pg_policy_oid_index, 3257, on pg_policy using btree(oid oid
DECLARE_UNIQUE_INDEX(pg_policy_polrelid_polname_index, 3258, on pg_policy using btree(polrelid oid_ops, polname name_ops));
#define PolicyPolrelidPolnameIndexId 3258
+DECLARE_UNIQUE_INDEX(pg_replication_identifier_riiident_index, 6001, on pg_replication_identifier using btree(riident oid_ops));
+#define ReplicationLocalIdentIndex 6001
+
+DECLARE_UNIQUE_INDEX(pg_replication_identifier_riname_index, 6002, on pg_replication_identifier using btree(riname varchar_pattern_ops));
+#define ReplicationExternalIdentIndex 6002
+
/* last step of initialization script: build the indexes declared above */
BUILD_INDICES
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8469c82..575ca36 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5201,6 +5201,36 @@ DESCR("for use by pg_upgrade");
DATA(insert OID = 3591 ( binary_upgrade_create_empty_extension PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "25 25 16 25 1028 1009 1009" _null_ _null_ _null_ _null_ binary_upgrade_create_empty_extension _null_ _null_ _null_ ));
DESCR("for use by pg_upgrade");
+/* replication_identifier.h */
+DATA(insert OID = 6003 ( pg_replication_identifier_create PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_create _null_ _null_ _null_ ));
+DESCR("create local replication identifier for the passed external one");
+
+DATA(insert OID = 6004 ( pg_replication_identifier_drop PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_drop _null_ _null_ _null_ ));
+DESCR("drop existing replication identifier");
+
+DATA(insert OID = 6005 ( pg_replication_identifier_get PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_get _null_ _null_ _null_ ));
+DESCR("translate the external node identifier to a local one");
+
+DATA(insert OID = 6006 ( pg_replication_identifier_setup_replaying_from PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_identifier_setup_replaying_from _null_ _null_ _null_ ));
+DESCR("setup from which node we are replaying transactions from currently");
+
+DATA(insert OID = 6007 ( pg_replication_identifier_reset_replaying_from PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_replication_identifier_reset_replaying_from _null_ _null_ _null_ ));
+DESCR("teardown configured replication identity");
+
+DATA(insert OID = 6008 ( pg_replication_identifier_setup_tx_origin PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "3220 1184" _null_ _null_ _null_ _null_ pg_replication_identifier_setup_tx_origin _null_ _null_ _null_ ));
+DESCR("setup transaction timestamp and origin lsn");
+
+DATA(insert OID = 6009 ( pg_replication_identifier_is_replaying PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_replication_identifier_is_replaying _null_ _null_ _null_ ));
+DESCR("is a replication identifier setup");
+
+DATA(insert OID = 6010 ( pg_replication_identifier_advance PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "25 3220" _null_ _null_ _null_ _null_ pg_replication_identifier_advance _null_ _null_ _null_ ));
+DESCR("advance replication itentifier to specific location");
+
+DATA(insert OID = 6011 ( pg_replication_identifier_progress PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 3220 "25 16" _null_ _null_ _null_ _null_ pg_replication_identifier_progress _null_ _null_ _null_ ));
+DESCR("get an individualreplication identifier's replication progress");
+
+DATA(insert OID = 6012 ( pg_get_replication_identifier_progress PGNSP PGUID 12 1 100 0 0 f f f f f t v 0 0 2249 "" "{26,25,3220,3220}" "{o,o,o,o}" "{local_id, external_id, remote_lsn, local_lsn}" _null_ pg_get_replication_identifier_progress _null_ _null_ _null_ ));
+DESCR("get progress for all replication identifiers");
/*
* Symbolic values for provolatile column: these indicate whether the result
diff --git a/src/include/catalog/pg_replication_identifier.h b/src/include/catalog/pg_replication_identifier.h
new file mode 100644
index 0000000..d72c839
--- /dev/null
+++ b/src/include/catalog/pg_replication_identifier.h
@@ -0,0 +1,74 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_replication_identifier.h
+ * Persistent Replication Node Identifiers
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/pg_replication_identifier.h
+ *
+ * NOTES
+ * the genbki.pl script reads this file and generates .bki
+ * information from the DATA() statements.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_REPLICATION_IDENTIFIER_H
+#define PG_REPLICATION_IDENTIFIER_H
+
+#include "catalog/genbki.h"
+#include "access/xlogdefs.h"
+
+/* ----------------
+ * pg_replication_identifier. cpp turns this into
+ * typedef struct FormData_pg_replication_identifier
+ * ----------------
+ */
+#define ReplicationIdentifierRelationId 6000
+
+CATALOG(pg_replication_identifier,6000) BKI_SHARED_RELATION BKI_WITHOUT_OIDS
+{
+ /*
+ * Locally known identifier that get included into WAL.
+ *
+ * This should never leave the system.
+ *
+ * Needs to fit into a uint16, so we don't waste too much space in WAL
+ * records. For this reason we don't use a normal Oid column here, since
+ * we need to handle allocation of new values manually.
+ */
+ Oid riident;
+
+ /*
+ * Variable-length fields start here, but we allow direct access to
+ * riname.
+ */
+
+ /* external, free-format, identifier */
+ text riname BKI_FORCE_NOT_NULL;
+#ifdef CATALOG_VARLEN /* further variable-length fields */
+#endif
+} FormData_pg_replication_identifier;
+
+/* ----------------
+ * Form_pg_extension corresponds to a pointer to a tuple with
+ * the format of pg_extension relation.
+ * ----------------
+ */
+typedef FormData_pg_replication_identifier *Form_pg_replication_identifier;
+
+/* ----------------
+ * compiler constants for pg_replication_identifier
+ * ----------------
+ */
+#define Natts_pg_replication_identifier 2
+#define Anum_pg_replication_riident 1
+#define Anum_pg_replication_riname 2
+
+/* ----------------
+ * pg_replication_identifier has no initial contents
+ * ----------------
+ */
+
+#endif /* PG_REPLICTION_IDENTIFIER_H */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index cce4394..f78fb8f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -97,4 +97,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepNodeId origin_id);
+
#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 0935c1b..26095b1 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -74,6 +74,13 @@ typedef void (*LogicalDecodeCommitCB) (
XLogRecPtr commit_lsn);
/*
+ * Filter changes by origin.
+ */
+typedef bool (*LogicalDecodeFilterByOriginCB) (
+ struct LogicalDecodingContext *,
+ RepNodeId origin_id);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (
@@ -89,6 +96,7 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f1e0f57..0c13fca 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -68,6 +68,8 @@ typedef struct ReorderBufferChange
/* The type of change. */
enum ReorderBufferChangeType action;
+ RepNodeId origin_id;
+
/*
* Context data for the change, which part of the union is valid depends
* on action/action_internal.
@@ -166,6 +168,10 @@ typedef struct ReorderBufferTXN
*/
XLogRecPtr restart_decoding_lsn;
+ /* origin of the change that caused this transaction */
+ RepNodeId origin_id;
+ XLogRecPtr origin_lsn;
+
/*
* Commit time, only known when we read the actual commit record.
*/
@@ -339,7 +345,7 @@ void ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
void ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time);
+ TimestampTz commit_time, RepNodeId origin_id, XLogRecPtr origin_lsn);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
diff --git a/src/include/replication/replication_identifier.h b/src/include/replication/replication_identifier.h
new file mode 100644
index 0000000..47cc032
--- /dev/null
+++ b/src/include/replication/replication_identifier.h
@@ -0,0 +1,62 @@
+/*-------------------------------------------------------------------------
+ * replication_identifier.h
+ * Exports from replication/logical/replication_identifier.c
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/replication_identifier.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef REPLICATION_IDENTIFIER_H
+#define REPLICATION_IDENTIFIER_H
+
+#include "access/xlogdefs.h"
+#include "catalog/pg_replication_identifier.h"
+#include "replication/logical.h"
+
+#define InvalidRepNodeId 0
+#define DoNotReplicateRepNodeId UINT16_MAX
+
+extern PGDLLIMPORT RepNodeId replication_origin_id;
+extern PGDLLIMPORT XLogRecPtr replication_origin_lsn;
+extern PGDLLIMPORT TimestampTz replication_origin_timestamp;
+
+/* API for querying & manipulating replication identifiers */
+extern RepNodeId GetReplicationIdentifier(char *name, bool missing_ok);
+extern RepNodeId CreateReplicationIdentifier(char *name);
+extern bool GetReplicationInfoByIdentifier(RepNodeId riident, bool missing_ok,
+ char **riname);
+extern void DropReplicationIdentifier(RepNodeId riident);
+
+/* API for querying & manipulating replication progress */
+extern void AdvanceReplicationIdentifier(RepNodeId node,
+ XLogRecPtr remote_commit,
+ XLogRecPtr local_commit);
+extern XLogRecPtr ReplicationIdentifierProgress(RepNodeId node, bool flush);
+extern void AdvanceCachedReplicationIdentifier(XLogRecPtr remote_commit,
+ XLogRecPtr local_commit);
+extern void SetupCachedReplicationIdentifier(RepNodeId node);
+extern void TeardownCachedReplicationIdentifier(void);
+extern XLogRecPtr CachedReplicationIdentifierProgress(void);
+
+/* crash recovery support */
+extern void CheckPointReplicationIdentifier(void);
+extern void StartupReplicationIdentifier(void);
+
+/* internals */
+extern Size ReplicationIdentifierShmemSize(void);
+extern void ReplicationIdentifierShmemInit(void);
+
+/* SQL callable functions */
+extern Datum pg_replication_identifier_get(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_create(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_drop(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_setup_replaying_from(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_reset_replaying_from(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_is_replaying(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_setup_tx_origin(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_progress(PG_FUNCTION_ARGS);
+extern Datum pg_get_replication_identifier_progress(PG_FUNCTION_ARGS);
+extern Datum pg_replication_identifier_advance(PG_FUNCTION_ARGS);
+
+#endif
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e3c2efc..919708b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -134,8 +134,9 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
#define ReplicationSlotControlLock (&MainLWLockArray[37].lock)
#define CommitTsControlLock (&MainLWLockArray[38].lock)
#define CommitTsLock (&MainLWLockArray[39].lock)
+#define ReplicationIdentifierLock (&MainLWLockArray[40].lock)
-#define NUM_INDIVIDUAL_LWLOCKS 40
+#define NUM_INDIVIDUAL_LWLOCKS 41
/*
* It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index ba0b090..d7be45a 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,8 @@ enum SysCacheIdentifier
RANGETYPE,
RELNAMENSP,
RELOID,
+ REPLIDIDENT,
+ REPLIDREMOTE,
RULERELNAME,
STATRELATTINH,
TABLESPACEOID,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 71fa44a..5030f9a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1390,6 +1390,11 @@ pg_prepared_xacts| SELECT p.transaction,
FROM ((pg_prepared_xact() p(transaction, gid, prepared, ownerid, dbid)
LEFT JOIN pg_authid u ON ((p.ownerid = u.oid)))
LEFT JOIN pg_database d ON ((p.dbid = d.oid)));
+pg_replication_identifier_progress| SELECT pg_get_replication_identifier_progress.local_id,
+ pg_get_replication_identifier_progress.external_id,
+ pg_get_replication_identifier_progress.remote_lsn,
+ pg_get_replication_identifier_progress.local_lsn
+ FROM pg_get_replication_identifier_progress() pg_get_replication_identifier_progress(local_id, external_id, remote_lsn, local_lsn);
pg_replication_slots| SELECT l.slot_name,
l.plugin,
l.slot_type,
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index c7be273..400cba3 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -121,6 +121,7 @@ pg_pltemplate|t
pg_policy|t
pg_proc|t
pg_range|t
+pg_replication_identifier|t
pg_rewrite|t
pg_seclabel|t
pg_shdepend|t
--
2.4.0.rc2.1.g3d6bc9a
On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote:
Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable.
Sounds bad. What padding are we talking about?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At 2015-04-17 10:54:51 +0200, andres@anarazel.de wrote:
(The FPI percentage display above is arguably borked. Interesting.)
Sorry for the trouble. Patch attached.
-- Abhijit
Attachments:
0001-Don-t-divide-by-zero-when-calculating-percentages.patchtext/x-diff; charset=us-asciiDownload
>From 1e5c5d5948030e8ff6ccdd2309a97fb1e116d8e2 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Fri, 17 Apr 2015 14:45:41 +0530
Subject: Don't divide by zero when calculating percentages
---
contrib/pg_xlogdump/pg_xlogdump.c | 53 ++++++++++++++++++++++++++++-----------
1 file changed, 38 insertions(+), 15 deletions(-)
diff --git a/contrib/pg_xlogdump/pg_xlogdump.c b/contrib/pg_xlogdump/pg_xlogdump.c
index 4f297e9..3f61c32 100644
--- a/contrib/pg_xlogdump/pg_xlogdump.c
+++ b/contrib/pg_xlogdump/pg_xlogdump.c
@@ -489,18 +489,36 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
*/
static void
XLogDumpStatsRow(const char *name,
- uint64 n, double n_pct,
- uint64 rec_len, double rec_len_pct,
- uint64 fpi_len, double fpi_len_pct,
- uint64 total_len, double total_len_pct)
+ uint64 n, uint64 total_count,
+ uint64 rec_len, uint64 total_rec_len,
+ uint64 fpi_len, uint64 total_fpi_len,
+ uint64 tot_len, uint64 total_len)
{
+ double n_pct, rec_len_pct, fpi_len_pct, tot_len_pct;
+
+ n_pct = 0;
+ if (total_count != 0)
+ n_pct = 100 * (double) n / total_count;
+
+ rec_len_pct = 0;
+ if (total_rec_len != 0)
+ rec_len_pct = 100 * (double) rec_len / total_rec_len;
+
+ fpi_len_pct = 0;
+ if (total_fpi_len != 0)
+ fpi_len_pct = 100 * (double) fpi_len / total_fpi_len;
+
+ tot_len_pct = 0;
+ if (total_len != 0)
+ tot_len_pct = 100 * (double) tot_len / total_len;
+
printf("%-27s "
"%20" INT64_MODIFIER "u (%6.02f) "
"%20" INT64_MODIFIER "u (%6.02f) "
"%20" INT64_MODIFIER "u (%6.02f) "
"%20" INT64_MODIFIER "u (%6.02f)\n",
name, n, n_pct, rec_len, rec_len_pct, fpi_len, fpi_len_pct,
- total_len, total_len_pct);
+ tot_len, tot_len_pct);
}
@@ -515,6 +533,7 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogDumpStats *stats)
uint64 total_rec_len = 0;
uint64 total_fpi_len = 0;
uint64 total_len = 0;
+ double rec_len_pct, fpi_len_pct;
/* ---
* Make a first pass to calculate column totals:
@@ -557,10 +576,8 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogDumpStats *stats)
tot_len = rec_len + fpi_len;
XLogDumpStatsRow(desc->rm_name,
- count, 100 * (double) count / total_count,
- rec_len, 100 * (double) rec_len / total_rec_len,
- fpi_len, 100 * (double) fpi_len / total_fpi_len,
- tot_len, 100 * (double) tot_len / total_len);
+ count, total_count, rec_len, total_rec_len,
+ fpi_len, total_fpi_len, tot_len, total_len);
}
else
{
@@ -583,10 +600,8 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogDumpStats *stats)
id = psprintf("UNKNOWN (%x)", rj << 4);
XLogDumpStatsRow(psprintf("%s/%s", desc->rm_name, id),
- count, 100 * (double) count / total_count,
- rec_len, 100 * (double) rec_len / total_rec_len,
- fpi_len, 100 * (double) fpi_len / total_fpi_len,
- tot_len, 100 * (double) tot_len / total_len);
+ count, total_count, rec_len, total_rec_len,
+ fpi_len, total_fpi_len, tot_len, total_len);
}
}
}
@@ -601,14 +616,22 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogDumpStats *stats)
* them from the earlier ones, and are thus up to 9 characters long.
*/
+ rec_len_pct = 0;
+ if (total_len != 0)
+ rec_len_pct = 100 * (double) total_rec_len / total_len;
+
+ fpi_len_pct = 0;
+ if (total_len != 0)
+ fpi_len_pct = 100 * (double) total_fpi_len / total_len;
+
printf("%-27s "
"%20" INT64_MODIFIER "u %-9s"
"%20" INT64_MODIFIER "u %-9s"
"%20" INT64_MODIFIER "u %-9s"
"%20" INT64_MODIFIER "u %-6s\n",
"Total", stats->count, "",
- total_rec_len, psprintf("[%.02f%%]", 100 * (double)total_rec_len / total_len),
- total_fpi_len, psprintf("[%.02f%%]", 100 * (double)total_fpi_len / total_len),
+ total_rec_len, psprintf("[%.02f%%]", rec_len_pct),
+ total_fpi_len, psprintf("[%.02f%%]", fpi_len_pct),
total_len, "[100%]");
}
--
1.9.1
On 04/17/2015 12:04 PM, Simon Riggs wrote:
On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote:
Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable.Sounds bad. What padding are we talking about?
In the new WAL format, the data chunks are stored unaligned, without
padding, to save space. The new format is quite different to the old
one, so it's not straightforward to compare how much that saved. The
fixed-size XLogRecord header is 8 bytes shorter in the new format,
because it doesn't have the xl_len field anymore. But the same
information is stored elsewhere in the record, where it takes 2 or 5
bytes (XLogRecordDataHeaderShort/Long).
But it's a fair point that we could've just made small adjustments to
the old format, without revamping every record type and the way the
block information is stored, and that the space saving of the new format
should be compared with that instead, for a fair comparison.
As an example, one simple thing we could've done with the old format:
remove xl_len, and store the length in place of the two unused padding
bytes instead, as long as it fits in 16 bits. For longer records, set a
flag and store it right after XLogRecord header. For practically all WAL
records, that would've shrunk XLogRecord from 32 to 24 bytes, and made
each record 8 bytes shorter.
I ran the same pgbench test Andres used, with scale 10, and 50000
transactions, and compared the WAL size between master and 9.4:
master: 20738352
9.4: 23915800
According to pg_xlogdump, there were 301153 WAL records. If you take the
9.4 figure, and imagine that we had saved those 8 bytes on each WAL
record, 9.4 would've been 21506576 bytes instead. So yeah, we could've
achieved much of the WAL savings with that much smaller change. That's a
useful thing to compare with.
BTW, those numbers are with wal_level=minimal. With wal_level=logical,
the WAL size from the same test on master was 26503520 bytes. That's
quite a bump. Looking at pg_xlogdump output, it seems that it's all
because the commit records are wider.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 April 2015 at 18:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 04/17/2015 12:04 PM, Simon Riggs wrote:
On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote:
Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable.Sounds bad. What padding are we talking about?
In the new WAL format, the data chunks are stored unaligned, without
padding, to save space. The new format is quite different to the old one,
so it's not straightforward to compare how much that saved.
The key point here is the whole WAL format was changed to accommodate a
minor requirement for one utility. Please notice that nobody tried to stop
you doing that.
The changes Andres is requesting have a very significant effect on a major
new facility. Perhaps there is concern that it is an external utility?
If we can trust Heikki to include code into core that was written
externally then I think we can do the same for Andres.
I think its time to stop the padding discussion and commit something
useful. We need this.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/17/2015 08:36 PM, Simon Riggs wrote:
On 17 April 2015 at 18:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 04/17/2015 12:04 PM, Simon Riggs wrote:
On 17 April 2015 at 09:54, Andres Freund <andres@anarazel.de> wrote:
Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable.Sounds bad. What padding are we talking about?
In the new WAL format, the data chunks are stored unaligned, without
padding, to save space. The new format is quite different to the old one,
so it's not straightforward to compare how much that saved.The key point here is the whole WAL format was changed to accommodate a
minor requirement for one utility. Please notice that nobody tried to stop
you doing that.The changes Andres is requesting have a very significant effect on a major
new facility. Perhaps there is concern that it is an external utility?If we can trust Heikki to include code into core that was written
externally then I think we can do the same for Andres.
I'm not concerned of the fact it is an external utility. Well, it
concerns me a little bit, because that means that it'll get little
testing with PostgreSQL. But that has nothing to do with the WAL size
question.
I think its time to stop the padding discussion and commit something
useful. We need this.
To be honest, I'm not entirely sure what we're arguing over. I said that
IMO the difference in WAL size is so small that we should just use
4-byte OIDs for the replication identifiers, instead of trying to make
do with 2 bytes. Not because I find it too likely that you'll run out of
IDs (although it could happen), but more to make replication IDs more
like all other system objects we have. Andreas did some pgbench
benchmarking to show that the difference in WAL size is about 10%. The
WAL records generated by pgbench happen to have just the right sizes so
that the 2-3 extra bytes bump them over to the next alignment boundary.
That's why there is such a big difference - on average it'll be less. I
think that's acceptable, Andreas seems to think otherwise. But if the
WAL size really is so precious, we could remove the two padding bytes
from XLogRecord, instead of dedicating them for the replication ids.
That would be an even better use for them.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 April 2015 at 19:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
To be honest, I'm not entirely sure what we're arguing over.
When arguing over something you consider small, it is customary to allow
the author precedence. We can't do things our own way all the time.
I didn't much like pg_rewind, but it doesn't hurt and you like it, so I
didn't object. We've all got better things to do.
I said that IMO the difference in WAL size is so small that we should just
use 4-byte OIDs for the replication identifiers, instead of trying to make
do with 2 bytes. Not because I find it too likely that you'll run out of
IDs (although it could happen), but more to make replication IDs more like
all other system objects we have. Andreas did some pgbench benchmarking to
show that the difference in WAL size is about 10%. The WAL records
generated by pgbench happen to have just the right sizes so that the 2-3
extra bytes bump them over to the next alignment boundary. That's why there
is such a big difference - on average it'll be less. I think that's
acceptable, Andreas seems to think otherwise. But if the WAL size really is
so precious, we could remove the two padding bytes from XLogRecord, instead
of dedicating them for the replication ids. That would be an even better
use for them.
The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend would
be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
that.
The change does nothing useful, since I doubt anyone will ever need >32768
nodes in their cluster.
Increasing WAL size for any non-zero amount is needlessly wasteful for a
change with only cosmetic value. But for a change that has significant
value for database resilience, it is a sensible use of bytes.
+1 to Andres' very reasonable suggestion. Lets commit this and go home.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 17/04/15 22:36, Simon Riggs wrote:
I said that IMO the difference in WAL size is so small that we
should just use 4-byte OIDs for the replication identifiers, instead
of trying to make do with 2 bytes. Not because I find it too likely
that you'll run out of IDs (although it could happen), but more to
make replication IDs more like all other system objects we have.
Andreas did some pgbench benchmarking to show that the difference in
WAL size is about 10%. The WAL records generated by pgbench happen
to have just the right sizes so that the 2-3 extra bytes bump them
over to the next alignment boundary. That's why there is such a big
difference - on average it'll be less. I think that's acceptable,
Andreas seems to think otherwise. But if the WAL size really is so
precious, we could remove the two padding bytes from XLogRecord,
instead of dedicating them for the replication ids. That would be an
even better use for them.The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend
would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
don't do that.The change does nothing useful, since I doubt anyone will ever need
32768 nodes in their cluster.
And if they did there would be other much bigger problems than
replication identifier being 16bit (it's actually >65534 as it's
unsigned btw).
Considering the importance and widespread use of replication I think we
should really make sure the related features have small overhead.
Increasing WAL size for any non-zero amount is needlessly wasteful for a
change with only cosmetic value. But for a change that has significant
value for database resilience, it is a sensible use of bytes.+1 to Andres' very reasonable suggestion. Lets commit this and go home.
+1
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/17/2015 11:36 PM, Simon Riggs wrote:
On 17 April 2015 at 19:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
To be honest, I'm not entirely sure what we're arguing over.
When arguing over something you consider small, it is customary to allow
the author precedence. We can't do things our own way all the time.
Sure, I'm not going to throw a tantrum if Andres commits this as it is.
I said that IMO the difference in WAL size is so small that we should just
use 4-byte OIDs for the replication identifiers, instead of trying to make
do with 2 bytes. Not because I find it too likely that you'll run out of
IDs (although it could happen), but more to make replication IDs more like
all other system objects we have. Andreas did some pgbench benchmarking to
show that the difference in WAL size is about 10%. The WAL records
generated by pgbench happen to have just the right sizes so that the 2-3
extra bytes bump them over to the next alignment boundary. That's why there
is such a big difference - on average it'll be less. I think that's
acceptable, Andreas seems to think otherwise. But if the WAL size really is
so precious, we could remove the two padding bytes from XLogRecord, instead
of dedicating them for the replication ids. That would be an even better
use for them.The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend would
be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
that.
That's a straw man argument. I'm not saying we should never use 2 byte
values anywhere. OID is usually used as the primary key in system
tables. There are exceptions, but that is nevertheless the norm. I'm
saying that saving in WAL size is not worth making an exception here,
and we should go with the simplest option of using OIDs.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/17/2015 11:45 PM, Petr Jelinek wrote:
The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend
would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
don't do that.The change does nothing useful, since I doubt anyone will ever need
32768 nodes in their cluster.
And if they did there would be other much bigger problems than
replication identifier being 16bit (it's actually >65534 as it's
unsigned btw).
Can you name some of the bigger problems you'd have?
Obviously, if you have 100000 high-volume OLTP nodes connected to a
single server, feeding transactions as a continous stream, you're going
to choke the system. But you might have 100000 tiny satellite databases
that sync up with the master every few hours, and each of them do only a
few updates per day.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote:
Can you name some of the bigger problems you'd have?
Several parts of the system are O(#max_replication_slots). Having 100k
outgoing logical replication slots is going to be expensive.
Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be
the biggest problem.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/17/2015 11:54 AM, Andres Freund wrote:
I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.
Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called
"pg_replication_identifier". But that's like talking about "index
identifiers", instead of just indexes, and calling the system table
pg_index_oid.
The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I
think we should roughly do a search-replace of "replication identifier"
-> "replication origin" to the patch. Or even "transaction origin".
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called "pg_replication_identifier".
But that's like talking about "index identifiers", instead of just indexes,
and calling the system table pg_index_oid.The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I think
we should roughly do a search-replace of "replication identifier" ->
"replication origin" to the patch. Or even "transaction origin".
Sounds good to me.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 April 2015 at 09:28, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called"pg_replication_identifier".
But that's like talking about "index identifiers", instead of just
indexes,
and calling the system table pg_index_oid.
The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. Ithink
we should roughly do a search-replace of "replication identifier" ->
"replication origin" to the patch. Or even "transaction origin".Sounds good to me.
+1
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-04-20 10:28:02 +0200, Andres Freund wrote:
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called "pg_replication_identifier".
But that's like talking about "index identifiers", instead of just indexes,
and calling the system table pg_index_oid.The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I think
we should roughly do a search-replace of "replication identifier" ->
"replication origin" to the patch. Or even "transaction origin".Sounds good to me.
I'm working on changing this (I've implemented the missing WAL
bits). I'd like to discuss the new terms for a sec, before I go and
revise the docs.
I'm now calling the feature 'replication progress tracking'. There's
"replication origins" and there's progress tracking infrastructure that
tracks how far data from a "replication origin" has replicated.
Catalog wise there's an actual table 'pg_replication_origin' that maps
between 'roident' and 'roname'. There's a pg_replication_progress view
(used to be named pg_replication_identifier_progress). I'm not sure if
the latter name isn't too generic? Maybe
pg_logical_replication_progress?
I've now named the functions:
* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
progress to a value. Primarily useful for copying values from other
systems and such.
* pg_replication_progress_get : How far did replay progress for a
certain origin
* pg_get_replication_progress : SRF returning the replay progress for
all origin.
Any comments?
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
I'm working on changing this (I've implemented the missing WAL
bits). I'd like to discuss the new terms for a sec, before I go and
revise the docs.I'm now calling the feature 'replication progress tracking'. There's
"replication origins" and there's progress tracking infrastructure that
tracks how far data from a "replication origin" has replicated.
Sounds good to me.
Catalog wise there's an actual table 'pg_replication_origin' that maps
between 'roident' and 'roname'. There's a pg_replication_progress view
(used to be named pg_replication_identifier_progress). I'm not sure if
the latter name isn't too generic? Maybe
pg_logical_replication_progress?
I think if we wanted "pg_logical_replication_progress" (and I don't
really agree that we do) then we would add the "logical" bit to the
names above as well. This seems unnecessary. pg_replication_progress
seems okay to me.
I've now named the functions:
* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
progress to a value. Primarily useful for copying values from other
systems and such.
These all look acceptable to me.
* pg_replication_progress_get : How far did replay progress for a
certain origin
* pg_get_replication_progress : SRF returning the replay progress for
all origin.
This combination seems confusing. In some other thread not too long ago
there was the argument that "all functions 'get' something, so that verb
should not appear in the function name". That would call for
"pg_replication_progress" on the singleton. Maybe to distinguish the
SRF, add "all" as a suffix?
* pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
Not sure about the "tx" here. We use "xact" as an abbreviation for
"transaction" in most places. If nowadays we don't like that term,
maybe just spell out "transaction" in full. I assume this function
pairs up with pg_replication_progress_setup_origin, yes?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-21 12:20:42 -0300, Alvaro Herrera wrote:
Andres Freund wrote:
Catalog wise there's an actual table 'pg_replication_origin' that maps
between 'roident' and 'roname'. There's a pg_replication_progress view
(used to be named pg_replication_identifier_progress). I'm not sure if
the latter name isn't too generic? Maybe
pg_logical_replication_progress?I think if we wanted "pg_logical_replication_progress" (and I don't
really agree that we do) then we would add the "logical" bit to the
names above as well. This seems unnecessary. pg_replication_progress
seems okay to me.
Cool.
* pg_replication_progress_get : How far did replay progress for a
certain origin
* pg_get_replication_progress : SRF returning the replay progress for
all origin.This combination seems confusing. In some other thread not too long ago
there was the argument that "all functions 'get' something, so that verb
should not appear in the function name".
That would call for "pg_replication_progress" on the singleton.
Hm. I don't like that. That'd e.g. clash with the above view. I think
it's good to distinguish between functions (that have a verb in the
name) and views/tables (that don't).
I agree that the above combination isn't optimal. Although pg_get (and
pg_stat_get) is what's used for a lot of other SRF backed views. Maybe
naming the SRF pg_get_all_replication_progress?
* pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)Not sure about the "tx" here. We use "xact" as an abbreviation for
"transaction" in most places.
Oh, yea. Xact is more consistent.
If nowadays we don't like that term, maybe just spell out
"transaction" in full. I assume this function pairs up with
pg_replication_progress_setup_origin, yes?
pg_replication_progress_setup_origin sets up the per session state,
setup_xact_details the "per replayed transaction" state.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote:
I've now named the functions:
* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
progress to a value. Primarily useful for copying values from other
systems and such.
* pg_replication_progress_get : How far did replay progress for a
certain origin
* pg_get_replication_progress : SRF returning the replay progress for
all origin.Any comments?
Why are we using functions for this rather than DDL?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-21 16:26:08 -0400, Robert Haas wrote:
On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote:
I've now named the functions:
* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
progress to a value. Primarily useful for copying values from other
systems and such.
* pg_replication_progress_get : How far did replay progress for a
certain origin
* pg_get_replication_progress : SRF returning the replay progress for
all origin.Any comments?
Why are we using functions for this rather than DDL?
Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 21/04/15 22:36, Andres Freund wrote:
On 2015-04-21 16:26:08 -0400, Robert Haas wrote:
On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote:
I've now named the functions:
* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
progress to a value. Primarily useful for copying values from other
systems and such.
* pg_replication_progress_get : How far did replay progress for a
certain origin
* pg_get_replication_progress : SRF returning the replay progress for
all origin.Any comments?
Why are we using functions for this rather than DDL?
Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.
I think the only value of having DDL for this would be consistency
(catalog objects are created via DDL) as it looks like something that
will be called only by extensions and not users during normal operation.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-03-24 22:22:29 +0100, Petr Jelinek wrote:
Perhaps we should have some Logical replication developer documentation
section and put all those three as subsections of that?
So I just played around with this and it didn't find it
worthwhile. Primarily because there's lots of uses of logical decoding
besides building a logical replication solution. I've reverted to
putting it into a separate chapter 'besides' logical decoding.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
On 04/17/2015 11:54 AM, Andres Freund wrote:
I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called "pg_replication_identifier".
But that's like talking about "index identifiers", instead of just indexes,
and calling the system table pg_index_oid.The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I think
we should roughly do a search-replace of "replication identifier" ->
"replication origin" to the patch. Or even "transaction origin".
Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.
I'm pretty happy with this state.
Greetings,
Andres Freund
Attachments:
0001-Introduce-replication-progress-tracking-infrastructu.patchtext/x-patch; charset=us-asciiDownload
>From fc406a87f2e2ac08a7ac112ef6b75be1e8256a16 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 24 Apr 2015 14:25:56 +0200
Subject: [PATCH] Introduce replication progress tracking infrastructure.
(v2.0)
When implementing a replication solution ontop of logical decoding two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented in this commit, consist
out of three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all former capabilities, except
that there's only 2^16 different origins; but now they integrate with
logical decoding. Additionally more functionality is accessible via SQL.
Since the commit timestamp infrastructure has also been introduced in
9.5 that's not a problem.
For now the number of origins for which the replication progress can be
tracked is determined by the max_replication_slots GUC. That GUC is not
a perfect match to configure this, but there doesn't seem to be
sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
---
contrib/test_decoding/Makefile | 3 +-
contrib/test_decoding/expected/replorigin.out | 141 +++
contrib/test_decoding/sql/replorigin.sql | 64 +
contrib/test_decoding/test_decoding.c | 28 +
doc/src/sgml/catalogs.sgml | 123 ++
doc/src/sgml/filelist.sgml | 1 +
doc/src/sgml/func.sgml | 201 ++-
doc/src/sgml/logicaldecoding.sgml | 35 +-
doc/src/sgml/postgres.sgml | 1 +
doc/src/sgml/replication-origins.sgml | 93 ++
src/backend/access/heap/heapam.c | 19 +
src/backend/access/rmgrdesc/Makefile | 4 +-
src/backend/access/rmgrdesc/replorigindesc.c | 61 +
src/backend/access/rmgrdesc/xactdesc.c | 24 +-
src/backend/access/transam/commit_ts.c | 53 +-
src/backend/access/transam/rmgr.c | 1 +
src/backend/access/transam/xact.c | 72 +-
src/backend/access/transam/xlog.c | 8 +
src/backend/access/transam/xloginsert.c | 27 +-
src/backend/access/transam/xlogreader.c | 6 +
src/backend/catalog/Makefile | 2 +-
src/backend/catalog/catalog.c | 8 +-
src/backend/catalog/system_views.sql | 7 +
src/backend/replication/logical/Makefile | 3 +-
src/backend/replication/logical/decode.c | 49 +-
src/backend/replication/logical/logical.c | 29 +
src/backend/replication/logical/origin.c | 1479 +++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 5 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/cache/syscache.c | 23 +
src/bin/pg_resetxlog/pg_resetxlog.c | 3 +
src/include/access/commit_ts.h | 14 +-
src/include/access/rmgrlist.h | 1 +
src/include/access/xact.h | 11 +
src/include/access/xlog.h | 1 +
src/include/access/xlog_internal.h | 2 +-
src/include/access/xlogdefs.h | 6 +
src/include/access/xloginsert.h | 1 +
src/include/access/xlogreader.h | 3 +
src/include/access/xlogrecord.h | 1 +
src/include/catalog/catversion.h | 2 +-
src/include/catalog/indexing.h | 6 +
src/include/catalog/pg_proc.h | 36 +
src/include/catalog/pg_replication_origin.h | 69 ++
src/include/replication/logical.h | 2 +
src/include/replication/origin.h | 86 ++
src/include/replication/output_plugin.h | 8 +
src/include/replication/reorderbuffer.h | 8 +-
src/include/storage/lwlock.h | 3 +-
src/include/utils/syscache.h | 2 +
src/test/regress/expected/rules.out | 5 +
src/test/regress/expected/sanity_check.out | 1 +
52 files changed, 2755 insertions(+), 89 deletions(-)
create mode 100644 contrib/test_decoding/expected/replorigin.out
create mode 100644 contrib/test_decoding/sql/replorigin.sql
create mode 100644 doc/src/sgml/replication-origins.sgml
create mode 100644 src/backend/access/rmgrdesc/replorigindesc.c
create mode 100644 src/backend/replication/logical/origin.c
create mode 100644 src/include/catalog/pg_replication_origin.h
create mode 100644 src/include/replication/origin.h
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 613e9c3..656eabf 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -37,7 +37,8 @@ submake-isolation:
submake-test_decoding:
$(MAKE) -C $(top_builddir)/contrib/test_decoding
-REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel binary prepared
+REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel \
+ binary prepared replorigin
regresscheck: all | submake-regress submake-test_decoding temp-install
$(MKDIR_P) regression_output
diff --git a/contrib/test_decoding/expected/replorigin.out b/contrib/test_decoding/expected/replorigin.out
new file mode 100644
index 0000000..c0f5125
--- /dev/null
+++ b/contrib/test_decoding/expected/replorigin.out
@@ -0,0 +1,141 @@
+-- predictability
+SET synchronous_commit = on;
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+SELECT pg_replication_origin_create('test_decoding: regression_slot');
+ pg_replication_origin_create
+------------------------------
+ 1
+(1 row)
+
+-- ensure duplicate creations fail
+SELECT pg_replication_origin_create('test_decoding: regression_slot');
+ERROR: duplicate key value violates unique constraint "pg_replication_origin_roname_index"
+DETAIL: Key (roname)=(test_decoding: regression_slot) already exists.
+--ensure deletions work (once)
+SELECT pg_replication_origin_create('test_decoding: temp');
+ pg_replication_origin_create
+------------------------------
+ 2
+(1 row)
+
+SELECT pg_replication_origin_drop('test_decoding: temp');
+ pg_replication_origin_drop
+----------------------------
+
+(1 row)
+
+SELECT pg_replication_origin_drop('test_decoding: temp');
+ERROR: cache lookup failed for replication origin 'test_decoding: temp'
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ BEGIN
+ table public.target_tbl: INSERT: id[integer]:1 data[text]:'BEGIN'
+ table public.target_tbl: INSERT: id[integer]:2 data[text]:'table public.origin_tbl: INSERT: id[integer]:1 data[text]:''will be replicated and decoded and decoded again'''
+ table public.target_tbl: INSERT: id[integer]:3 data[text]:'COMMIT'
+ COMMIT
+(5 rows)
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+-- mark session as replaying
+SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
+ pg_replication_origin_session_setup
+-------------------------------------
+
+(1 row)
+
+-- ensure we prevent duplicate setup
+SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
+ERROR: cannot setup replication origin when one is already setup
+BEGIN;
+-- setup transaction origin
+SELECT pg_replication_origin_xact_setup('0/aabbccdd', '2013-01-01 00:00');
+ pg_replication_origin_xact_setup
+----------------------------------
+
+(1 row)
+
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+-- check replication progress for the session is correct
+SELECT pg_replication_origin_session_progress(false);
+ pg_replication_origin_session_progress
+----------------------------------------
+ 0/AABBCCDD
+(1 row)
+
+SELECT pg_replication_origin_session_progress(true);
+ pg_replication_origin_session_progress
+----------------------------------------
+ 0/AABBCCDD
+(1 row)
+
+SELECT pg_replication_origin_session_reset();
+ pg_replication_origin_session_reset
+-------------------------------------
+
+(1 row)
+
+SELECT local_id, external_id, remote_lsn, local_lsn <> '0/0' FROM pg_replication_origin_status;
+ local_id | external_id | remote_lsn | ?column?
+----------+--------------------------------+------------+----------
+ 1 | test_decoding: regression_slot | 0/AABBCCDD | t
+(1 row)
+
+-- check replication progress identified by name is correct
+SELECT pg_replication_origin_progress('test_decoding: regression_slot', false);
+ pg_replication_origin_progress
+--------------------------------
+ 0/AABBCCDD
+(1 row)
+
+SELECT pg_replication_origin_progress('test_decoding: regression_slot', true);
+ pg_replication_origin_progress
+--------------------------------
+ 0/AABBCCDD
+(1 row)
+
+-- ensure reset requires previously setup state
+SELECT pg_replication_origin_session_reset();
+ERROR: no replication origin is configured
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+------
+(0 rows)
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+ data
+--------------------------------------------------------------------------------
+ BEGIN
+ table public.origin_tbl: INSERT: id[integer]:3 data[text]:'will be replicated'
+ COMMIT
+(3 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_replication_origin_drop('test_decoding: regression_slot');
+ pg_replication_origin_drop
+----------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/replorigin.sql b/contrib/test_decoding/sql/replorigin.sql
new file mode 100644
index 0000000..e12404e
--- /dev/null
+++ b/contrib/test_decoding/sql/replorigin.sql
@@ -0,0 +1,64 @@
+-- predictability
+SET synchronous_commit = on;
+
+CREATE TABLE origin_tbl(id serial primary key, data text);
+CREATE TABLE target_tbl(id serial primary key, data text);
+
+SELECT pg_replication_origin_create('test_decoding: regression_slot');
+-- ensure duplicate creations fail
+SELECT pg_replication_origin_create('test_decoding: regression_slot');
+
+--ensure deletions work (once)
+SELECT pg_replication_origin_create('test_decoding: temp');
+SELECT pg_replication_origin_drop('test_decoding: temp');
+SELECT pg_replication_origin_drop('test_decoding: temp');
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+-- origin tx
+INSERT INTO origin_tbl(data) VALUES ('will be replicated and decoded and decoded again');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- as is normal, the insert into target_tbl shows up
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+INSERT INTO origin_tbl(data) VALUES ('will be replicated, but not decoded again');
+
+-- mark session as replaying
+SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
+
+-- ensure we prevent duplicate setup
+SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
+
+BEGIN;
+-- setup transaction origin
+SELECT pg_replication_origin_xact_setup('0/aabbccdd', '2013-01-01 00:00');
+INSERT INTO target_tbl(data)
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+COMMIT;
+
+-- check replication progress for the session is correct
+SELECT pg_replication_origin_session_progress(false);
+SELECT pg_replication_origin_session_progress(true);
+
+SELECT pg_replication_origin_session_reset();
+
+SELECT local_id, external_id, remote_lsn, local_lsn <> '0/0' FROM pg_replication_origin_status;
+
+-- check replication progress identified by name is correct
+SELECT pg_replication_origin_progress('test_decoding: regression_slot', false);
+SELECT pg_replication_origin_progress('test_decoding: regression_slot', true);
+
+-- ensure reset requires previously setup state
+SELECT pg_replication_origin_session_reset();
+
+-- and magically the replayed xact will be filtered!
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+--but new original changes still show up
+INSERT INTO origin_tbl(data) VALUES ('will be replicated');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'only-local', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_replication_origin_drop('test_decoding: regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 963d5df..bca03ee 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -21,6 +21,7 @@
#include "replication/output_plugin.h"
#include "replication/logical.h"
+#include "replication/origin.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -43,6 +44,7 @@ typedef struct
bool include_timestamp;
bool skip_empty_xacts;
bool xact_wrote_changes;
+ bool only_local;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +61,8 @@ static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
+static bool pg_decode_filter(LogicalDecodingContext *ctx,
+ RepOriginId origin_id);
void
_PG_init(void)
@@ -76,6 +80,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
}
@@ -97,6 +102,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_xids = true;
data->include_timestamp = false;
data->skip_empty_xacts = false;
+ data->only_local = false;
ctx->output_plugin_private = data;
@@ -155,6 +161,17 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "only-local") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->only_local = true;
+ else if (!parse_bool(strVal(elem->arg), &data->only_local))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -223,6 +240,17 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+static bool
+pg_decode_filter(LogicalDecodingContext *ctx,
+ RepOriginId origin_id)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->only_local && origin_id != InvalidRepOriginId)
+ return true;
+ return false;
+}
+
/*
* Print literal `outputstr' already represented as string of type `typid'
* into stringbuf `s'.
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 4e6fd0e..742658c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -239,6 +239,16 @@
</row>
<row>
+ <entry><link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link></entry>
+ <entry>registered replication origins</entry>
+ </row>
+
+ <row>
+ <entry><link linkend="catalog-pg-replication-origin-status"><structname>pg_replication_origin_status</structname></link></entry>
+ <entry>information about replication origins, including replication progress</entry>
+ </row>
+
+ <row>
<entry><link linkend="catalog-pg-replication-slots"><structname>pg_replication_slots</structname></link></entry>
<entry>replication slot information</entry>
</row>
@@ -5323,6 +5333,119 @@
</sect1>
+ <sect1 id="catalog-pg-replication-origin">
+ <title><structname>pg_replication_origin</structname></title>
+
+ <indexterm zone="catalog-pg-replication-origin">
+ <primary>pg_replication_origin</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_replication_origin</structname> catalog contains
+ all replication origins created. For more on replication origins
+ see <xref linkend="replication-origins">.
+ </para>
+
+ <table>
+
+ <title><structname>pg_replication_origin</structname> Columns</title>
+
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Type</entry>
+ <entry>References</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>roident</structfield></entry>
+ <entry><type>Oid</type></entry>
+ <entry></entry>
+ <entry>A unique, cluster-wide identifier for the replication
+ origin. Should never leave the system.</entry>
+ </row>
+
+ <row>
+ <entry><structfield>roname</structfield></entry>
+ <entry><type>text</type></entry>
+ <entry></entry>
+ <entry>The external, user defined, name of a replication
+ origin.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
+ <sect1 id="catalog-pg-replication-origin-status">
+ <title><structname>pg_replication_origin_status</structname></title>
+
+ <indexterm zone="catalog-pg-replication-origin-status">
+ <primary>pg_replication_origin_status</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_replication_origin_status</structname> view
+ contains information about how far replay for a certain origin has
+ progressed. For more on replication origins
+ see <xref linkend="replication-origins">.
+ </para>
+
+ <table>
+
+ <title><structname>pg_replication_origin_status</structname> Columns</title>
+
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Type</entry>
+ <entry>References</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>local_id</structfield></entry>
+ <entry><type>Oid</type></entry>
+ <entry><literal><link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link>.roident</literal></entry>
+ <entry>internal node identifier</entry>
+ </row>
+
+ <row>
+ <entry><structfield>external_id</structfield></entry>
+ <entry><type>text</type></entry>
+ <entry><literal><link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link>.roname</literal></entry>
+ <entry>external node identifier</entry>
+ </row>
+
+ <row>
+ <entry><structfield>remote_lsn</structfield></entry>
+ <entry><type>pg_lsn</type></entry>
+ <entry></entry>
+ <entry>The origin node's LSN up to which data has been replicated.</entry>
+ </row>
+
+
+ <row>
+ <entry><structfield>local_lsn</structfield></entry>
+ <entry><type>pg_lsn</type></entry>
+ <entry></entry>
+ <entry>This node's LSN that at
+ which <literal>remote_lsn</literal> has been replicated. Used to
+ flush commit records before persisting data to disk when using
+ asynchronous commits.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
<sect1 id="catalog-pg-replication-slots">
<title><structname>pg_replication_slots</structname></title>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 26aa7ee..6268d54 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -95,6 +95,7 @@
<!ENTITY fdwhandler SYSTEM "fdwhandler.sgml">
<!ENTITY custom-scan SYSTEM "custom-scan.sgml">
<!ENTITY logicaldecoding SYSTEM "logicaldecoding.sgml">
+<!ENTITY replication-origins SYSTEM "replication-origins.sgml">
<!ENTITY protocol SYSTEM "protocol.sgml">
<!ENTITY sources SYSTEM "sources.sgml">
<!ENTITY storage SYSTEM "storage.sgml">
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5f7bf6a..c53f80c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16874,11 +16874,13 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<title>Replication Functions</title>
<para>
- The functions shown in <xref linkend="functions-replication-table"> are
- for controlling and interacting with replication features.
- See <xref linkend="streaming-replication">
- and <xref linkend="streaming-replication-slots"> for information about the
- underlying features. Use of these functions is restricted to superusers.
+ The functions shown
+ in <xref linkend="functions-replication-table"> are for
+ controlling and interacting with replication features.
+ See <xref linkend="streaming-replication">,
+ <xref linkend="streaming-replication-slots">, <xref linkend="replication-origins">
+ for information about the underlying features. Use of these
+ functions is restricted to superusers.
</para>
<para>
@@ -17035,6 +17037,195 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
on future calls.
</entry>
</row>
+
+ <row id="pg-replication-origin-create">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_create</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_create(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ <parameter>internal_id</parameter> <type>oid</type>
+ </entry>
+ <entry>
+ Create a replication origin with the the passed in external
+ name, and create an internal id for it.
+ </entry>
+ </row>
+
+ <row id="pg-replication-origin-drop">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_drop</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_drop(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Delete a previously created replication origin, including the
+ associated replay progress.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_oid</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_oid(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ <parameter>internal_id</parameter> <type>oid</type>
+ </entry>
+ <entry>
+ Lookup replication origin by name and return the internal
+ oid. If no corresponding replication origin is found a error
+ is thrown.
+ </entry>
+ </row>
+
+ <row id="pg-replication-origin-session-setup">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_session_setup</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_setup_session(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Configure the current session to be replaying from the passed in
+ origin, allowing replay progress to be tracked. Use
+ <function>pg_replication_origin_session_reset</function> to revert.
+ Can only be used if no previous origin is configured.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_session_reset</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_session_reset()</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Cancel the effects
+ of <function>pg_replication_origin_session_setup()</function>.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_session_is_setup</primary>
+ </indexterm>
+ <literal><function>pg_replication_session_is_setup()</function></literal>
+ </entry>
+ <entry>
+ bool
+ </entry>
+ <entry>
+ Has a replication origin been configured in the current session?
+ </entry>
+ </row>
+
+ <row id="pg-replication-origin-session-progress">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_session_progress</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_progress(<parameter>flush</parameter> <type>bool</type>)</function></literal>
+ </entry>
+ <entry>
+ pg_lsn
+ </entry>
+ <entry>
+ Return the replay position for the replication origin configured in
+ the current session. The parameter <parameter>flush</parameter>
+ determines whether the corresponding local transaction will be
+ guaranteed to have been flushed to disk or not.
+ </entry>
+ </row>
+
+ <row id="pg-replication-origin-xact-setup">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_xact_setup</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_xact_setup(<parameter>origin_lsn</parameter> <type>pg_lsn</type>, <parameter>origin_timestamp</parameter> <type>timestamptz</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Mark the current transaction to be replaying a transaction that has
+ committed at the passed in <acronym>LSN</acronym> and timestamp. Can
+ only be called when a replication origin has previously been
+ configured using
+ <function>pg_replication_origin_session_setup()</function>.
+ </entry>
+ </row>
+
+ <row id="pg-replication-origin-xact-reset">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_xact_reset</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_xact_reset()</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Cancel the effects of
+ <function>pg_replication_origin_xact_setup()</function>.
+ </entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_advance</primary>
+ </indexterm>
+ <literal>pg_replication_origin_advance<function>(<parameter>node_name</parameter> <type>text</type>, <parameter>pos</parameter> <type>pg_lsn</type>)</function></literal>
+ </entry>
+ <entry>
+ void
+ </entry>
+ <entry>
+ Set replication progress for the passed in node to the passed in
+ position. This primarily is useful for setting up the initial position
+ or a new position after configuration changes and similar. Be aware
+ that careless use of this function can lead to inconsistently
+ replicated data.
+ </entry>
+ </row>
+
+ <row id="pg-replication-origin-progress">
+ <entry>
+ <indexterm>
+ <primary>pg_replication_origin_progress</primary>
+ </indexterm>
+ <literal><function>pg_replication_origin_progress(<parameter>node_name</parameter> <type>text</type>, <parameter>flush</parameter> <type>bool</type>)</function></literal>
+ </entry>
+ <entry>
+ pg_lsn
+ </entry>
+ <entry>
+ Return the replay position for the passed in replication
+ identifier. The parameter <parameter>flush</parameter> determines
+ whether the corresponding local transaction will be guaranteed to have
+ been flushed to disk or not.
+ </entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 0810a2d..f817af3 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -363,6 +363,7 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -370,7 +371,8 @@ typedef void (*LogicalOutputPluginInit)(struct OutputPluginCallbacks *cb);
</programlisting>
The <function>begin_cb</function>, <function>change_cb</function>
and <function>commit_cb</function> callbacks are required,
- while <function>startup_cb</function>
+ while <function>startup_cb</function>,
+ <function>filter_by_origin_cb</function>
and <function>shutdown_cb</function> are optional.
</para>
</sect2>
@@ -569,6 +571,37 @@ typedef void (*LogicalDecodeChangeCB) (
</para>
</note>
</sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-filter-by-origin">
+ <title>Origin Filter Callback</title>
+
+ <para>
+ The optional <function>filter_by_origin_cb</function> callback
+ is called to determine wheter data that has been replayed
+ from <parameter>origin_id</parameter> is of interest to the
+ output plugin.
+<programlisting>
+typedef bool (*LogicalDecodeChangeCB) (
+ struct LogicalDecodingContext *ctx,
+ RepNodeId origin_id
+);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. No information but the origin is
+ available. To signal that changes originating on the passed in
+ node are irrelevant, return true, causing them to be filtered
+ away; false otherwise. The other callbacks will not be called
+ for transactions and changes that have been filtered away.
+ </para>
+ <para>
+ This is useful when implementing cascading or multi directional
+ replication solutions. Filtering by the origin allows to
+ prevent replicating the same changes back and forth in such
+ setups. While transactions and changes also carry information
+ about the origin, filtering via this callback is noticeably
+ more efficient.
+ </para>
+ </sect3>
</sect2>
<sect2 id="logicaldecoding-output-plugin-output">
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index e378d69..4a45138 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -220,6 +220,7 @@
&spi;
&bgworker;
&logicaldecoding;
+ &replication-origins;
</part>
diff --git a/doc/src/sgml/replication-origins.sgml b/doc/src/sgml/replication-origins.sgml
new file mode 100644
index 0000000..c531022
--- /dev/null
+++ b/doc/src/sgml/replication-origins.sgml
@@ -0,0 +1,93 @@
+<!-- doc/src/sgml/replication-origins.sgml -->
+<chapter id="replication-origins">
+ <title>Replication Progress Tracking</title>
+ <indexterm zone="replication-origins">
+ <primary>Replication Progress Tracking</primary>
+ </indexterm>
+ <indexterm zone="replication-origins">
+ <primary>Replication Origins</primary>
+ </indexterm>
+
+ <para>
+ Replication origins are intended to make it easier to implement
+ logical replication solutions on top
+ of <xref linkend="logicaldecoding">. They provide a solution to two
+ common problems:
+ <itemizedlist>
+ <listitem><para>How to safely keep track of replication progress</para></listitem>
+ <listitem><para>How to change replication behavior, based on the
+ origin of a row; e.g. to avoid loops in bi-directional replication
+ setups</para></listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ Replication origins consist out of a name and a oid. The name, which
+ is what should be used to refer to the origin across systems, is
+ free-form text. It should be used in a way that makes conflicts
+ between replication origins created by different replication
+ solutions unlikely; e.g. by prefixing the replication solution's
+ name to it. The oid is used only to avoid having to store the long
+ version in situations where space efficiency is important. It should
+ never be shared between systems.
+ </para>
+
+ <para>
+ Replication origins can be created using the
+ <link linkend="pg-replication-origin-create"><function>pg_replication_origin_create()</function></link>;
+ dropped using
+ <link linkend="pg-replication-origin-drop"><function>pg_replication_origin_drop()</function></link>;
+ and seen in the
+ <link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link>
+ catalog.
+ </para>
+
+ <para>
+ When replicating from one system to another (independent of the fact that
+ those two might be in the same cluster, or even same database) one
+ nontrivial part of building a replication solution is to keep track of
+ replay progress in a safe manner. When the applying process, or the whole
+ cluster, dies, it needs to be possible to find out up to where data has
+ successfully been replicated. Naive solutions to this like updating a row in
+ a table for every replayed transaction have problems like runtime overhead
+ bloat.
+ </para>
+
+ <para>
+ Using the replication origin infrastructure a session can be
+ marked as replaying from a remote node (using the
+ <link linkend="pg-replication-origin-session-setup"><function>pg_replication_origin_session_setup()</function></link>
+ function. Additionally the <acronym>LSN</acronym> and commit
+ timestamp of every source transaction can be configured on a per
+ transaction basis using
+ <link linkend="pg-replication-origin-xact-setup"><function>pg_replication_origin_xact-setup()</function></link>.
+ If that's done replication progress will be persist in a crash safe
+ manner. Replay progress for all replication origins can be seen in the
+ <link linkend="catalog-pg-replication-origin-status">
+ <structname>pg_replication_origin_status</structname>
+ </link> view. A individual origin's progress, e.g. when resuming
+ replication, can be acquired using
+ <link linkend="pg-replication-origin-progress"><function>pg_replication_origin_progress()</function></link>
+ for any origin or
+ <link linkend="pg-replication-origin-session-progress"><function>pg_replication_origin_session_progress()</function></link>
+ for the origin configured in the current session.
+ </para>
+
+ <para>
+ In more complex replication topologies than replication from exactly one
+ system to one other, another problem can be that, that it is hard to avoid
+ replicating replayed rows again. That can lead both to cycles in the
+ replication and inefficiencies. Replication origins provide a optional
+ mechanism to recognize and prevent that. When configured using the functions
+ referenced in the previous paragraph, every change and transaction passed to
+ output plugin callbacks (see <xref linkend="logicaldecoding-output-plugin">)
+ generated by the session is tagged with the replication origin of the
+ generating session. This allows to treat them differently in the output
+ plugin, e.g. ignoring all but locally originating rows. Additionally
+ the <link linkend="logicaldecoding-output-plugin-filter-by-origin">
+ <function>filter_by_origin_cb</function></link> callback can be used
+ to filter the logical decoding change stream based on the
+ source. While less flexible, filtering via that callback is
+ considerably more efficient.
+ </para>
+</chapter>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 457cd70..b504ccd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2189,6 +2189,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
(char *) heaptup->t_data + SizeofHeapTupleHeader,
heaptup->t_len - SizeofHeapTupleHeader);
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP_ID, info);
PageSetLSN(page, recptr);
@@ -2499,6 +2502,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
XLogRegisterBufData(0, tupledata, totaldatalen);
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP2_ID, info);
PageSetLSN(page, recptr);
@@ -2920,6 +2927,9 @@ l1:
- SizeofHeapTupleHeader);
}
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
PageSetLSN(page, recptr);
@@ -4650,6 +4660,8 @@ failed:
tuple->t_data->t_infomask2);
XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ /* we don't decode row locks atm, so no need to log the origin */
+
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
PageSetLSN(page, recptr);
@@ -5429,6 +5441,8 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
+ /* inplace updates aren't decoded atm, don't log the origin */
+
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
PageSetLSN(page, recptr);
@@ -6787,6 +6801,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
old_key_tuple->t_len - SizeofHeapTupleHeader);
}
+ /* filtering by origin on a row level is much more efficient */
+ XLogIncludeOrigin();
+
recptr = XLogInsert(RM_HEAP_ID, info);
return recptr;
@@ -6860,6 +6877,8 @@ log_heap_new_cid(Relation relation, HeapTuple tup)
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapNewCid);
+ /* will be looked at irrespective of origin */
+
recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_NEW_CID);
return recptr;
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index d18e8ec..c72a1f2 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -9,8 +9,8 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
- hashdesc.o heapdesc.o \
- mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
+ hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
+ replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c
new file mode 100644
index 0000000..19bae9a
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replorigindesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * replorigindesc.c
+ * rmgr descriptor routines for replication/logical/replication_origin.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/replorigindesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/origin.h"
+
+void
+replorigin_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case XLOG_REPLORIGIN_SET:
+ {
+ xl_replorigin_set *xlrec;
+ xlrec = (xl_replorigin_set *) rec;
+
+ appendStringInfo(buf, "set %u; lsn %X/%X; force: %d",
+ xlrec->node_id,
+ (uint32) (xlrec->remote_lsn >> 32),
+ (uint32) xlrec->remote_lsn,
+ xlrec->force);
+ break;
+ }
+ case XLOG_REPLORIGIN_DROP:
+ {
+ xl_replorigin_drop *xlrec;
+ xlrec = (xl_replorigin_drop *) rec;
+
+ appendStringInfo(buf, "drop %u", xlrec->node_id);
+ break;
+ }
+ }
+}
+
+const char *
+replorigin_identify(uint8 info)
+{
+ switch (info)
+ {
+ case XLOG_REPLORIGIN_SET:
+ return "SET";
+ case XLOG_REPLORIGIN_DROP:
+ return "DROP";
+ default:
+ return NULL;
+ }
+}
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index b036b6d..3297e1d 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -101,6 +101,16 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
data += sizeof(xl_xact_twophase);
}
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin *xl_origin = (xl_xact_origin *) data;
+
+ parsed->origin_lsn = xl_origin->origin_lsn;
+ parsed->origin_timestamp = xl_origin->origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
+ }
}
void
@@ -156,7 +166,7 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
}
static void
-xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec)
+xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepOriginId origin_id)
{
xl_xact_parsed_commit parsed;
int i;
@@ -218,6 +228,15 @@ xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec)
if (XactCompletionForceSyncCommit(parsed.xinfo))
appendStringInfo(buf, "; sync");
+
+ if (parsed.xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ appendStringInfo(buf, "; origin: node %u, lsn %X/%X, at %s",
+ origin_id,
+ (uint32)(parsed.origin_lsn >> 32),
+ (uint32)parsed.origin_lsn,
+ timestamptz_to_str(parsed.origin_timestamp));
+ }
}
static void
@@ -274,7 +293,8 @@ xact_desc(StringInfo buf, XLogReaderState *record)
{
xl_xact_commit *xlrec = (xl_xact_commit *) rec;
- xact_desc_commit(buf, XLogRecGetInfo(record), xlrec);
+ xact_desc_commit(buf, XLogRecGetInfo(record), xlrec,
+ XLogRecGetOrigin(record));
}
else if (info == XLOG_XACT_ABORT || info == XLOG_XACT_ABORT_PREPARED)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index dc23ab2..40042a5 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -49,18 +49,18 @@
*/
/*
- * We need 8+4 bytes per xact. Note that enlarging this struct might mean
+ * We need 8+2 bytes per xact. Note that enlarging this struct might mean
* the largest possible file name is more than 5 chars long; see
* SlruScanDirectory.
*/
typedef struct CommitTimestampEntry
{
TimestampTz time;
- CommitTsNodeId nodeid;
+ RepOriginId nodeid;
} CommitTimestampEntry;
#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
- sizeof(CommitTsNodeId))
+ sizeof(RepOriginId))
#define COMMIT_TS_XACTS_PER_PAGE \
(BLCKSZ / SizeOfCommitTimestampEntry)
@@ -93,43 +93,18 @@ CommitTimestampShared *commitTsShared;
/* GUC variable */
bool track_commit_timestamp;
-static CommitTsNodeId default_node_id = InvalidCommitTsNodeId;
-
static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz ts,
- CommitTsNodeId nodeid, int pageno);
+ RepOriginId nodeid, int pageno);
static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
- CommitTsNodeId nodeid, int slotno);
+ RepOriginId nodeid, int slotno);
static int ZeroCommitTsPage(int pageno, bool writeXlog);
static bool CommitTsPagePrecedes(int page1, int page2);
static void WriteZeroPageXlogRec(int pageno);
static void WriteTruncateXlogRec(int pageno);
static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid);
-
-
-/*
- * CommitTsSetDefaultNodeId
- *
- * Set default nodeid for current backend.
- */
-void
-CommitTsSetDefaultNodeId(CommitTsNodeId nodeid)
-{
- default_node_id = nodeid;
-}
-
-/*
- * CommitTsGetDefaultNodeId
- *
- * Set default nodeid for current backend.
- */
-CommitTsNodeId
-CommitTsGetDefaultNodeId(void)
-{
- return default_node_id;
-}
+ RepOriginId nodeid);
/*
* TransactionTreeSetCommitTsData
@@ -156,7 +131,7 @@ CommitTsGetDefaultNodeId(void)
void
TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid, bool do_xlog)
+ RepOriginId nodeid, bool do_xlog)
{
int i;
TransactionId headxid;
@@ -234,7 +209,7 @@ TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
static void
SetXidCommitTsInPage(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz ts,
- CommitTsNodeId nodeid, int pageno)
+ RepOriginId nodeid, int pageno)
{
int slotno;
int i;
@@ -259,7 +234,7 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
- CommitTsNodeId nodeid, int slotno)
+ RepOriginId nodeid, int slotno)
{
int entryno = TransactionIdToCTsEntry(xid);
CommitTimestampEntry entry;
@@ -282,7 +257,7 @@ TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
*/
bool
TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
- CommitTsNodeId *nodeid)
+ RepOriginId *nodeid)
{
int pageno = TransactionIdToCTsPage(xid);
int entryno = TransactionIdToCTsEntry(xid);
@@ -322,7 +297,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (ts)
*ts = 0;
if (nodeid)
- *nodeid = InvalidCommitTsNodeId;
+ *nodeid = InvalidRepOriginId;
return false;
}
@@ -373,7 +348,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
* as NULL if not wanted.
*/
TransactionId
-GetLatestCommitTsData(TimestampTz *ts, CommitTsNodeId *nodeid)
+GetLatestCommitTsData(TimestampTz *ts, RepOriginId *nodeid)
{
TransactionId xid;
@@ -503,7 +478,7 @@ CommitTsShmemInit(void)
commitTsShared->xidLastCommit = InvalidTransactionId;
TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
- commitTsShared->dataLastCommit.nodeid = InvalidCommitTsNodeId;
+ commitTsShared->dataLastCommit.nodeid = InvalidRepOriginId;
}
else
Assert(found);
@@ -857,7 +832,7 @@ WriteTruncateXlogRec(int pageno)
static void
WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid)
+ RepOriginId nodeid)
{
xl_commit_ts_set record;
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index acd825f..7c4d773 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -23,6 +23,7 @@
#include "commands/dbcommands_xlog.h"
#include "commands/sequence.h"
#include "commands/tablespace.h"
+#include "replication/origin.h"
#include "storage/standby.h"
#include "utils/relmapper.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1495bb4..34fc0ec 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -40,8 +40,10 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/logical.h"
#include "replication/walsender.h"
#include "replication/syncrep.h"
+#include "replication/origin.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -1073,21 +1075,23 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
InvalidTransactionId /* plain commit */);
- }
- /*
- * We only need to log the commit timestamp separately if the node
- * identifier is a valid value; the commit record above already contains
- * the timestamp info otherwise, and will be used to load it.
- */
- if (markXidCommitted)
- {
- CommitTsNodeId node_id;
+ /* record plain commit ts if not replaying remote actions */
+ if (replident_sesssion_origin == InvalidRepOriginId ||
+ replident_sesssion_origin == DoNotReplicateId ||
+ replident_sesssion_origin_timestamp == 0)
+ replident_sesssion_origin_timestamp = xactStopTimestamp;
+ else
+ replorigin_session_advance(replident_sesssion_origin_lsn,
+ XactLastRecEnd);
- node_id = CommitTsGetDefaultNodeId();
+ /*
+ * We don't need to WAL log here, the commit record contains all the
+ * necessary information and will redo the SET action during replay.
+ */
TransactionTreeSetCommitTsData(xid, nchildren, children,
- xactStopTimestamp,
- node_id, node_id != InvalidCommitTsNodeId);
+ replident_sesssion_origin_timestamp,
+ replident_sesssion_origin, false);
}
/*
@@ -1176,9 +1180,11 @@ RecordTransactionCommit(void)
if (wrote_xlog && markXidCommitted)
SyncRepWaitForLSN(XactLastRecEnd);
+ /* remember end of last commit record */
+ XactLastCommitEnd = XactLastRecEnd;
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
-
cleanup:
/* Clean up local data */
if (rels)
@@ -4611,6 +4617,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_invals xl_invals;
xl_xact_twophase xl_twophase;
+ xl_xact_origin xl_origin;
uint8 info;
@@ -4668,6 +4675,15 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_twophase.xid = twophase_xid;
}
+ /* dump transaction origin information */
+ if (replident_sesssion_origin != InvalidRepOriginId)
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replident_sesssion_origin_lsn;
+ xl_origin.origin_timestamp = replident_sesssion_origin_timestamp;
+ }
+
if (xl_xinfo.xinfo != 0)
info |= XLOG_XACT_HAS_INFO;
@@ -4709,6 +4725,12 @@ XactLogCommitRecord(TimestampTz commit_time,
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ /* we allow filtering by xacts */
+ XLogIncludeOrigin();
+
return XLogInsert(RM_XACT_ID, info);
}
@@ -4806,10 +4828,12 @@ XactLogAbortRecord(TimestampTz abort_time,
static void
xact_redo_commit(xl_xact_parsed_commit *parsed,
TransactionId xid,
- XLogRecPtr lsn)
+ XLogRecPtr lsn,
+ RepOriginId origin_id)
{
TransactionId max_xid;
int i;
+ TimestampTz commit_time;
max_xid = TransactionIdLatest(xid, parsed->nsubxacts, parsed->subxacts);
@@ -4829,9 +4853,16 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
LWLockRelease(XidGenLock);
}
+ Assert(!!(parsed->xinfo & XACT_XINFO_HAS_ORIGIN) == (origin_id != InvalidRepOriginId));
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ commit_time = parsed->origin_timestamp;
+ else
+ commit_time = parsed->xact_time;
+
/* Set the transaction commit timestamp and metadata */
TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
- parsed->xact_time, InvalidCommitTsNodeId,
+ commit_time, origin_id,
false);
if (standbyState == STANDBY_DISABLED)
@@ -4892,6 +4923,13 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
StandbyReleaseLockTree(xid, 0, NULL);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+ false /* backward */, false /* WAL */);
+ }
+
/* Make sure files supposed to be dropped are dropped */
if (parsed->nrels > 0)
{
@@ -5047,13 +5085,13 @@ xact_redo(XLogReaderState *record)
{
Assert(!TransactionIdIsValid(parsed.twophase_xid));
xact_redo_commit(&parsed, XLogRecGetXid(record),
- record->EndRecPtr);
+ record->EndRecPtr, XLogRecGetOrigin(record));
}
else
{
Assert(TransactionIdIsValid(parsed.twophase_xid));
xact_redo_commit(&parsed, parsed.twophase_xid,
- record->EndRecPtr);
+ record->EndRecPtr, XLogRecGetOrigin(record));
RemoveTwoPhaseFile(parsed.twophase_xid, false);
}
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..da7b6c2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
#include "postmaster/startup.h"
#include "replication/logical.h"
#include "replication/slot.h"
+#include "replication/origin.h"
#include "replication/snapbuild.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
@@ -295,6 +296,7 @@ static TimeLineID curFileTLI;
static XLogRecPtr ProcLastRecPtr = InvalidXLogRecPtr;
XLogRecPtr XactLastRecEnd = InvalidXLogRecPtr;
+XLogRecPtr XactLastCommitEnd = InvalidXLogRecPtr;
/*
* RedoRecPtr is this backend's local copy of the REDO record pointer
@@ -6212,6 +6214,11 @@ StartupXLOG(void)
StartupMultiXact();
/*
+ * Recover knowledge about replay progress of known replication partners.
+ */
+ StartupReplicationOrigin();
+
+ /*
* Initialize unlogged LSN. On a clean shutdown, it's restored from the
* control file. On recovery, all unlogged relations are blown away, so
* the unlogged LSN counter can be reset too.
@@ -8394,6 +8401,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointBuffers(flags); /* performs all required fsyncs */
+ CheckPointReplicationOrigin();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 618f879..72abd0b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -26,6 +26,7 @@
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
#include "miscadmin.h"
+#include "replication/origin.h"
#include "storage/bufmgr.h"
#include "storage/proc.h"
#include "utils/memutils.h"
@@ -72,6 +73,9 @@ static XLogRecData *mainrdata_head;
static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
static uint32 mainrdata_len; /* total # of bytes in chain */
+/* Should te in-progress insertion log the origin */
+static bool include_origin = false;
+
/*
* These are used to hold the record header while constructing a record.
* 'hdr_scratch' is not a plain variable, but is palloc'd at initialization,
@@ -83,10 +87,12 @@ static uint32 mainrdata_len; /* total # of bytes in chain */
static XLogRecData hdr_rdt;
static char *hdr_scratch = NULL;
+#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+
#define HEADER_SCRATCH_SIZE \
(SizeOfXLogRecord + \
MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
- SizeOfXLogRecordDataHeaderLong)
+ SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
/*
* An array of XLogRecData structs, to hold registered data.
@@ -193,6 +199,7 @@ XLogResetInsertion(void)
max_registered_block_id = 0;
mainrdata_len = 0;
mainrdata_last = (XLogRecData *) &mainrdata_head;
+ include_origin = false;
begininsert_called = false;
}
@@ -375,6 +382,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
}
/*
+ * Should this record include the replication origin if one is set up?
+ */
+void
+XLogIncludeOrigin(void)
+{
+ Assert(begininsert_called);
+ include_origin = true;
+}
+
+/*
* Insert an XLOG record having the specified RMID and info bytes, with the
* body of the record being the data and buffer references registered earlier
* with XLogRegister* calls.
@@ -678,6 +695,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
scratch += sizeof(BlockNumber);
}
+ /* followed by the record's origin, if any */
+ if (include_origin && replident_sesssion_origin != InvalidRepOriginId)
+ {
+ *(scratch++) = XLR_BLOCK_ID_ORIGIN;
+ memcpy(scratch, &replident_sesssion_origin, sizeof(replident_sesssion_origin));
+ scratch += sizeof(replident_sesssion_origin);
+ }
+
/* followed by main data, if any */
if (mainrdata_len > 0)
{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 77be1b8..3661e72 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -21,6 +21,7 @@
#include "access/xlogreader.h"
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
+#include "replication/origin.h"
static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -975,6 +976,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
ResetDecoder(state);
state->decoded_record = record;
+ state->record_origin = InvalidRepOriginId;
ptr = (char *) record;
ptr += SizeOfXLogRecord;
@@ -1009,6 +1011,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
break; /* by convention, the main data fragment is
* always last */
}
+ else if (block_id == XLR_BLOCK_ID_ORIGIN)
+ {
+ COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+ }
else if (block_id <= XLR_MAX_BLOCK_ID)
{
/* XLogRecordBlockHeader */
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index a403c64..e1af7ca 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -39,7 +39,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
pg_ts_config.h pg_ts_config_map.h pg_ts_dict.h \
pg_ts_parser.h pg_ts_template.h pg_extension.h \
pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
- pg_foreign_table.h pg_policy.h \
+ pg_foreign_table.h pg_policy.h pg_replication_origin.h \
pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \
toasting.h indexing.h \
)
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index e9d3cdc..fa2aa27 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -32,6 +32,7 @@
#include "catalog/pg_namespace.h"
#include "catalog/pg_pltemplate.h"
#include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_replication_origin.h"
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
@@ -224,7 +225,8 @@ IsSharedRelation(Oid relationId)
relationId == SharedDependRelationId ||
relationId == SharedSecLabelRelationId ||
relationId == TableSpaceRelationId ||
- relationId == DbRoleSettingRelationId)
+ relationId == DbRoleSettingRelationId ||
+ relationId == ReplicationOriginRelationId)
return true;
/* These are their indexes (see indexing.h) */
if (relationId == AuthIdRolnameIndexId ||
@@ -240,7 +242,9 @@ IsSharedRelation(Oid relationId)
relationId == SharedSecLabelObjectIndexId ||
relationId == TablespaceOidIndexId ||
relationId == TablespaceNameIndexId ||
- relationId == DbRoleSettingDatidRolidIndexId)
+ relationId == DbRoleSettingDatidRolidIndexId ||
+ relationId == ReplicationOriginIdentIndex ||
+ relationId == ReplicationOriginNameIndex)
return true;
/* These are their toast tables and toast indexes (see toasting.h) */
if (relationId == PgShdescriptionToastTable ||
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4c35ef4..2ad01f4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -778,6 +778,13 @@ CREATE VIEW pg_user_mappings AS
REVOKE ALL on pg_user_mapping FROM public;
+
+CREATE VIEW pg_replication_origin_status AS
+ SELECT *
+ FROM pg_show_replication_origin_status();
+
+REVOKE ALL ON pg_replication_origin_status FROM public;
+
--
-- We have a few function definitions in here, too.
-- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index 310a45c..8adea13 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
-OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o snapbuild.o
+OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o origin.o \
+ snapbuild.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eb7293f..8842496 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -40,6 +40,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/origin.h"
#include "replication/snapbuild.h"
#include "storage/standby.h"
@@ -131,6 +132,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
case RM_SPGIST_ID:
case RM_BRIN_ID:
case RM_COMMIT_TS_ID:
+ case RM_REPLORIGIN_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
@@ -422,6 +424,15 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
}
}
+static inline bool
+FilterByOrigin(LogicalDecodingContext *ctx, RepOriginId origin_id)
+{
+ if (ctx->callbacks.filter_by_origin_cb == NULL)
+ return false;
+
+ return filter_by_origin_cb_wrapper(ctx, origin_id);
+}
+
/*
* Consolidated commit record handling between the different form of commit
* records.
@@ -430,8 +441,17 @@ static void
DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid)
{
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ XLogRecPtr commit_time = InvalidXLogRecPtr;
+ XLogRecPtr origin_id = InvalidRepOriginId;
int i;
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
/*
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
@@ -452,12 +472,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* the reorderbuffer to forget the content of the (sub-)transactions
* if not.
*
- * There basically two reasons we might not be interested in this
+ * There can be several reasons we might not be interested in this
* transaction:
* 1) We might not be interested in decoding transactions up to this
* LSN. This can happen because we previously decoded it and now just
* are restarting or if we haven't assembled a consistent snapshot yet.
* 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
*
* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
@@ -472,7 +493,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* ---
*/
if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
- (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database))
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
{
for (i = 0; i < parsed->nsubxacts; i++)
{
@@ -492,7 +514,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
/* replay actions of all transaction + subtransactions in order */
ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- parsed->xact_time);
+ commit_time, origin_id, origin_lsn);
}
/*
@@ -537,8 +559,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -579,8 +606,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_UPDATE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -628,8 +660,13 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (target_node.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_DELETE;
+ change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
@@ -673,6 +710,10 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (rnode.dbNode != ctx->slot->data.database)
return;
+ /* output plugin doesn't look for this origin, no need to queue */
+ if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+ return;
+
tupledata = XLogRecGetBlockData(r, 0, &tuplelen);
data = tupledata;
@@ -685,6 +726,8 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
change = ReorderBufferGetChange(ctx->reorder);
change->action = REORDER_BUFFER_CHANGE_INSERT;
+ change->origin_id = XLogRecGetOrigin(r);
+
memcpy(&change->data.tp.relnode, &rnode, sizeof(RelFileNode));
/*
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 774ebbc..45d1436 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -39,6 +39,7 @@
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/reorderbuffer.h"
+#include "replication/origin.h"
#include "replication/snapbuild.h"
#include "storage/proc.h"
@@ -720,6 +721,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+bool
+filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
+{
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "shutdown";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_by_origin_cb(ctx, origin_id);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
/*
* Set the required catalog xmin horizon for historic snapshots in the current
* replication slot.
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
new file mode 100644
index 0000000..e56ff60
--- /dev/null
+++ b/src/backend/replication/logical/origin.c
@@ -0,0 +1,1479 @@
+/*-------------------------------------------------------------------------
+ *
+ * origin.c
+ * Logical replication progress tracking support.
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/logical/origin.c
+ *
+ * NOTES
+ *
+ * This file provides the following:
+ * * An infrastructure to name nodes in a replication setup
+ * * A facility to efficiently store and persist replication progress in a
+ * efficient and durable manner.
+ *
+ * Replication origin consist out of a descriptive, user defined, external
+ * name and a short, thus space efficient, internal 2 byte one. This split
+ * exists because replication origin have to be stored in WAL and shared
+ * memory and long descriptors would be inefficient. For now only use 2 bytes
+ * for the internal id of a replication origin as it seems unlikely that there
+ * soon will be more than 65k nodes in one replication setup; and using only
+ * two bytes allow us to be more space efficient.
+ *
+ * Replication progress is tracked in a shared memory table
+ * (ReplicationStates) that's dumped to disk every checkpoint. Entries
+ * ('slots') in this table are identified by the internal id. That's the case
+ * because it allows to increase replication progress during crash
+ * recovery. To allow doing so we store the original LSN (from the originating
+ * system) of a transaction in the commit record. That allows to recover the
+ * precise replayed state after crash recovery; without requiring synchronous
+ * commits. Allowing logical replication to use asynchronous commit is
+ * generally good for performance, but especially important as it allows a
+ * single threaded replay process to keep up with a source that has multiple
+ * backends generating changes concurrently. For efficiency and simplicity
+ * reasons a backend can setup one replication origin that's from then used as
+ * the source of changes produced by the backend, until reset again.
+ *
+ * This infrastructure is intended to be used in cooperation with logical
+ * decoding. When replaying from a remote system the configured origin is
+ * provided to output plugins, allowing prevention of replication loops and
+ * other filtering.
+ *
+ * There are several levels of locking at work:
+ *
+ * * To create and drop replication origins a exclusive lock on
+ * pg_replication_slot is required for the duration. That allows us to
+ * safely and conflict free assign new origins using a dirty snapshot.
+ *
+ * * When creating a in-memory replication progress slot the ReplicationOirgin
+ * LWLock has to be held exclusively; when iterating over the replication
+ * progress a shared lock has to be held, the same when advancing the
+ * replication progress of a individual backend that has not setup as the
+ * session's replication origin.
+ *
+ * * When manipulating or looking at the remote_lsn and local_lsn fields of a
+ * replication progress slot that slot's lwlock has to be held. That's
+ * primarily because we do not assume 8 byte writes (the LSN) is atomic on
+ * all our platforms, but it also simplifies memory ordering concerns
+ * between the remote and local lsn. We use a lwlock instead of a spinlock
+ * so it's less harmful to hold the lock over a WAL write
+ * (c.f. AdvanceReplicationProgress).
+ *
+ * ---------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <unistd.h>
+#include <sys/stat.h>
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+
+#include "catalog/indexing.h"
+
+#include "nodes/execnodes.h"
+
+#include "replication/origin.h"
+#include "replication/logical.h"
+
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/copydir.h"
+
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/pg_lsn.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+
+/*
+ * Replay progress of a single remote node.
+ */
+typedef struct ReplicationState
+{
+ /*
+ * Local identifier for the remote node.
+ */
+ RepOriginId roident;
+
+ /*
+ * Location of the latest commit from the remote side.
+ */
+ XLogRecPtr remote_lsn;
+
+ /*
+ * Remember the local lsn of the commit record so we can XLogFlush() to it
+ * during a checkpoint so we know the commit record actually is safe on
+ * disk.
+ */
+ XLogRecPtr local_lsn;
+
+ /*
+ * Slot is setup in backend?
+ */
+ pid_t acquired_by;
+
+ /*
+ * Lock protecting remote_lsn and local_lsn.
+ */
+ LWLock lock;
+} ReplicationState;
+
+/*
+ * On disk version of ReplicationState.
+ */
+typedef struct ReplicationStateOnDisk
+{
+ RepOriginId roident;
+ XLogRecPtr remote_lsn;
+} ReplicationStateOnDisk;
+
+
+typedef struct ReplicationStateCtl
+{
+ int tranche_id;
+ LWLockTranche tranche;
+ ReplicationState states[FLEXIBLE_ARRAY_MEMBER];
+} ReplicationStateCtl;
+
+/* external variables */
+RepOriginId replident_sesssion_origin = InvalidRepOriginId; /* assumed identity */
+XLogRecPtr replident_sesssion_origin_lsn = InvalidXLogRecPtr;
+TimestampTz replident_sesssion_origin_timestamp = 0;
+
+/*
+ * Base address into a shared memory array of replication states of size
+ * max_replication_slots.
+ *
+ * XXX: Should we use a separate variable to size this rather than
+ * max_replication_slots?
+ */
+static ReplicationState *replication_states;
+static ReplicationStateCtl *replication_states_ctl;
+
+/*
+ * Backend-local, cached element from ReplicationStates for use in a backend
+ * replaying remote commits, so we don't have to search ReplicationStates for
+ * the backends current RepOriginId.
+ */
+static ReplicationState *session_replication_state = NULL;
+
+/* Magic for on disk files. */
+#define REPLICATION_STATE_MAGIC ((uint32) 0x1257DADE)
+
+static void
+replorigin_check_prerequisites(bool check_slots)
+{
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("only superusers can query or manipulate replication origins")));
+
+ if (check_slots && max_replication_slots == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot query or manipulate replication origin when max_replication_slots = 0")));
+}
+
+
+/* ---------------------------------------------------------------------------
+ * Functions for working with replication origins themselves.
+ * ---------------------------------------------------------------------------
+ */
+
+/*
+ * Check for a persistent replication origin identified by name.
+ *
+ * Returns InvalidOid if the node isn't known yet and missing_ok is true.
+ */
+RepOriginId
+replorigin_by_name(char *roname, bool missing_ok)
+{
+ Form_pg_replication_origin ident;
+ Oid roident = InvalidOid;
+ HeapTuple tuple;
+ Datum roname_d;
+
+ roname_d = CStringGetTextDatum(roname);
+
+ tuple = SearchSysCache1(REPLORIGNAME, roname_d);
+ if (HeapTupleIsValid(tuple))
+ {
+ ident = (Form_pg_replication_origin) GETSTRUCT(tuple);
+ roident = ident->roident;
+ ReleaseSysCache(tuple);
+ }
+ else if (!missing_ok)
+ elog(ERROR, "cache lookup failed for replication origin '%s'",
+ roname);
+
+ return roident;
+}
+
+/*
+ * Create a replication origin.
+ *
+ * Needs to be called in a transaction.
+ */
+RepOriginId
+replorigin_create(char *roname)
+{
+ Oid roident;
+ HeapTuple tuple = NULL;
+ Relation rel;
+ Datum roname_d;
+ SnapshotData SnapshotDirty;
+ SysScanDesc scan;
+ ScanKeyData key;
+
+ roname_d = CStringGetTextDatum(roname);
+
+ Assert(IsTransactionState());
+
+ /*
+ * We need the numeric replication origin to be 16bit wide, so we cannot
+ * rely on the normal oid allocation. Instead we simply scan
+ * pg_replication_origin for the first unused id. That's not particularly
+ * efficient, but this should be an fairly infrequent operation - we can
+ * easily spend a bit more code on this when it turns out it needs to be
+ * faster.
+ *
+ * We handle concurrency by taking an exclusive lock (allowing reads!)
+ * over the table for the duration of the search. Because we use a "dirty
+ * snapshot" we can read rows that other in-progress sessions have
+ * written, even though they would be invisible with normal snapshots. Due
+ * to the exclusive lock there's no danger that new rows can appear while
+ * we're checking.
+ */
+ InitDirtySnapshot(SnapshotDirty);
+
+ rel = heap_open(ReplicationOriginRelationId, ExclusiveLock);
+
+ for (roident = InvalidOid + 1; roident < UINT16_MAX; roident++)
+ {
+ bool nulls[Natts_pg_replication_origin];
+ Datum values[Natts_pg_replication_origin];
+ bool collides;
+ CHECK_FOR_INTERRUPTS();
+
+ ScanKeyInit(&key,
+ Anum_pg_replication_origin_roident,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(roident));
+
+ scan = systable_beginscan(rel, ReplicationOriginIdentIndex,
+ true /* indexOK */,
+ &SnapshotDirty,
+ 1, &key);
+
+ collides = HeapTupleIsValid(systable_getnext(scan));
+
+ systable_endscan(scan);
+
+ if (!collides)
+ {
+ /*
+ * Ok, found an unused roident, insert the new row and do a CCI,
+ * so our callers can look it up if they want to.
+ */
+ memset(&nulls, 0, sizeof(nulls));
+
+ values[Anum_pg_replication_origin_roident -1] = ObjectIdGetDatum(roident);
+ values[Anum_pg_replication_origin_roname - 1] = roname_d;
+
+ tuple = heap_form_tuple(RelationGetDescr(rel), values, nulls);
+ simple_heap_insert(rel, tuple);
+ CatalogUpdateIndexes(rel, tuple);
+ CommandCounterIncrement();
+ break;
+ }
+ }
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+
+ if (tuple == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("no free replication oid could be found")));
+
+ heap_freetuple(tuple);
+ return roident;
+}
+
+
+/*
+ * Drop replication origin.
+ *
+ * Needs to be called in a transaction.
+ */
+void
+replorigin_drop(RepOriginId roident)
+{
+ HeapTuple tuple = NULL;
+ Relation rel;
+ int i;
+
+ Assert(IsTransactionState());
+
+ rel = heap_open(ReplicationOriginRelationId, ExclusiveLock);
+
+ /* cleanup the slot state info */
+ LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state = &replication_states[i];
+
+ /* found our slot */
+ if (state->roident == roident)
+ {
+ if (state->acquired_by != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("cannot drop replication origin with oid %d, in use by pid %d",
+ state->roident,
+ state->acquired_by)));
+ }
+
+ /* first WAL log */
+ {
+ xl_replorigin_drop xlrec;
+
+ xlrec.node_id = roident;
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+ XLogInsert(RM_REPLORIGIN_ID, XLOG_REPLORIGIN_DROP);
+ }
+
+ /* then reset the in-memory entry */
+ state->roident = InvalidRepOriginId;
+ state->remote_lsn = InvalidXLogRecPtr;
+ state->local_lsn = InvalidXLogRecPtr;
+ break;
+ }
+ }
+ LWLockRelease(ReplicationOriginLock);
+
+ tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
+ simple_heap_delete(rel, &tuple->t_self);
+ ReleaseSysCache(tuple);
+
+ CommandCounterIncrement();
+
+ /* now release lock again, */
+ heap_close(rel, ExclusiveLock);
+}
+
+
+/*
+ * Lookup replication origin via it's oid and return the name.
+ *
+ * The external name is palloc'd in the calling context.
+ *
+ * Returns true if the origin is known, false otherwise.
+ */
+bool
+replorigin_by_oid(RepOriginId roident, bool missing_ok, char **roname)
+{
+ HeapTuple tuple;
+ Form_pg_replication_origin ric;
+
+ Assert(OidIsValid((Oid) roident));
+ Assert(roident != InvalidRepOriginId);
+ Assert(roident != DoNotReplicateId);
+
+ tuple = SearchSysCache1(REPLORIGIDENT,
+ ObjectIdGetDatum((Oid) roident));
+
+ if (HeapTupleIsValid(tuple))
+ {
+ ric = (Form_pg_replication_origin) GETSTRUCT(tuple);
+ *roname = text_to_cstring(&ric->roname);
+ ReleaseSysCache(tuple);
+
+ return true;
+ }
+ else
+ {
+ *roname = NULL;
+
+ if (!missing_ok)
+ elog(ERROR, "cache lookup failed for replication origin with oid %u",
+ roident);
+
+ return false;
+ }
+}
+
+
+/* ---------------------------------------------------------------------------
+ * Functions for handling replication progress.
+ * ---------------------------------------------------------------------------
+ */
+
+Size
+ReplicationOriginShmemSize(void)
+{
+ Size size = 0;
+
+ /*
+ * XXX: max_replication_slots is arguablethe wrong thing to use here, here
+ * we keep the replay state of *remote* transactions. But for now it seems
+ * sufficient to reuse it, lest we introduce a separate guc.
+ */
+ if (max_replication_slots == 0)
+ return size;
+
+ size = add_size(size, offsetof(ReplicationStateCtl, states));
+
+ size = add_size(size,
+ mul_size(max_replication_slots, sizeof(ReplicationState)));
+ return size;
+}
+
+void
+ReplicationOriginShmemInit(void)
+{
+ bool found;
+
+ if (max_replication_slots == 0)
+ return;
+
+ replication_states_ctl = (ReplicationStateCtl *)
+ ShmemInitStruct("ReplicationOriginState",
+ ReplicationOriginShmemSize(),
+ &found);
+ replication_states = replication_states_ctl->states;
+
+ if (!found)
+ {
+ int i;
+
+ replication_states_ctl->tranche_id = LWLockNewTrancheId();
+ replication_states_ctl->tranche.name = "ReplicationOrigins";
+ replication_states_ctl->tranche.array_base =
+ &replication_states[0].lock;
+ replication_states_ctl->tranche.array_stride =
+ sizeof(ReplicationState);
+
+ MemSet(replication_states, 0, ReplicationOriginShmemSize());
+
+ for (i = 0; i < max_replication_slots; i++)
+ LWLockInitialize(&replication_states[i].lock,
+ replication_states_ctl->tranche_id);
+ }
+
+ LWLockRegisterTranche(replication_states_ctl->tranche_id,
+ &replication_states_ctl->tranche);
+}
+
+/* ---------------------------------------------------------------------------
+ * Perform a checkpoint of each replication origin's progress with respect to
+ * the replayed remote_lsn. Make sure that all transactions we refer to in the
+ * checkpoint (local_lsn) are actually on-disk. This might not yet be the case
+ * if the transactions were originally committed asynchronously.
+ *
+ * We store checkpoints in the following format:
+ * +-------+------------------------+------------------+-----+--------+
+ * | MAGIC | ReplicationStateOnDisk | struct Replic... | ... | CRC32C | EOF
+ * +-------+------------------------+------------------+-----+--------+
+ *
+ * So its just the magic, followed by the statically sized
+ * ReplicationStateOnDisk structs. Note that the maximum number of
+ * ReplicationStates is determined by max_replication_slots.
+ * ---------------------------------------------------------------------------
+ */
+void
+CheckPointReplicationOrigin(void)
+{
+ const char *tmppath = "pg_logical/replident_checkpoint.tmp";
+ const char *path = "pg_logical/replident_checkpoint";
+ int tmpfd;
+ int i;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ pg_crc32c crc;
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ /* make sure no old temp file is remaining */
+ if (unlink(tmppath) < 0 && errno != ENOENT)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ path)));
+
+ /*
+ * no other backend can perform this at the same time, we're protected by
+ * CheckpointLock.
+ */
+ tmpfd = OpenTransientFile((char *) tmppath,
+ O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
+ S_IRUSR | S_IWUSR);
+ if (tmpfd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m",
+ tmppath)));
+
+ /* write magic */
+ if ((write(tmpfd, &magic, sizeof(magic))) != sizeof(magic))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ /* prevent concurrent creations/drops */
+ LWLockAcquire(ReplicationOriginLock, LW_SHARED);
+
+ /* write actual data */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationStateOnDisk disk_state;
+ ReplicationState *curstate = &replication_states[i];
+ XLogRecPtr local_lsn;
+
+ if (curstate->roident == InvalidRepOriginId)
+ continue;
+
+ LWLockAcquire(&curstate->lock, LW_SHARED);
+
+ disk_state.roident = curstate->roident;
+
+ disk_state.remote_lsn = curstate->remote_lsn;
+ local_lsn = curstate->local_lsn;
+
+ LWLockRelease(&curstate->lock);
+
+ /* make sure we only write out a commit that's persistent */
+ XLogFlush(local_lsn);
+
+ if ((write(tmpfd, &disk_state, sizeof(disk_state))) !=
+ sizeof(disk_state))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+ }
+
+ LWLockRelease(ReplicationOriginLock);
+
+ /* write out the CRC */
+ FIN_CRC32C(crc);
+ if ((write(tmpfd, &crc, sizeof(crc))) != sizeof(crc))
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to file \"%s\": %m",
+ tmppath)));
+ }
+
+ /* fsync the temporary file */
+ if (pg_fsync(tmpfd) != 0)
+ {
+ CloseTransientFile(tmpfd);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync file \"%s\": %m",
+ tmppath)));
+ }
+
+ CloseTransientFile(tmpfd);
+
+ /* rename to permanent file, fsync file and directory */
+ if (rename(tmppath, path) != 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+ tmppath, path)));
+ }
+
+ fsync_fname((char *) path, false);
+ fsync_fname("pg_logical", true);
+}
+
+/*
+ * Recover replication replay status from checkpoint data saved earlier by
+ * CheckPointReplicationOrigin.
+ *
+ * This only needs to be called at startup and *not* during every checkpoint
+ * read during recovery (e.g. in HS or PITR from a base backup) afterwards. All
+ * state thereafter can be recovered by looking at commit records.
+ */
+void
+StartupReplicationOrigin(void)
+{
+ const char *path = "pg_logical/replident_checkpoint";
+ int fd;
+ int readBytes;
+ uint32 magic = REPLICATION_STATE_MAGIC;
+ int last_state = 0;
+ pg_crc32c file_crc;
+ pg_crc32c crc;
+
+ /* don't want to overwrite already existing state */
+#ifdef USE_ASSERT_CHECKING
+ static bool already_started = false;
+ Assert(!already_started);
+ already_started = true;
+#endif
+
+ if (max_replication_slots == 0)
+ return;
+
+ INIT_CRC32C(crc);
+
+ elog(DEBUG2, "starting up replication origin progress state");
+
+ fd = OpenTransientFile((char *) path, O_RDONLY | PG_BINARY, 0);
+
+ /*
+ * might have had max_replication_slots == 0 last run, or we just brought up a
+ * standby.
+ */
+ if (fd < 0 && errno == ENOENT)
+ return;
+ else if (fd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+
+ /* verify magic, thats written even if nothing was active */
+ readBytes = read(fd, &magic, sizeof(magic));
+ if (readBytes != sizeof(magic))
+ ereport(PANIC,
+ (errmsg("could not read file \"%s\": %m",
+ path)));
+ COMP_CRC32C(crc, &magic, sizeof(magic));
+
+ if (magic != REPLICATION_STATE_MAGIC)
+ ereport(PANIC,
+ (errmsg("replication checkpoint has wrong magic %u instead of %u",
+ magic, REPLICATION_STATE_MAGIC)));
+
+ /* we can skip locking here, no other access is possible */
+
+ /* recover individual states, until there are no more to be found */
+ while (true)
+ {
+ ReplicationStateOnDisk disk_state;
+
+ readBytes = read(fd, &disk_state, sizeof(disk_state));
+
+ /* no further data */
+ if (readBytes == sizeof(crc))
+ {
+ /* not pretty, but simple ... */
+ file_crc = *(pg_crc32c*) &disk_state;
+ break;
+ }
+
+ if (readBytes < 0)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ path)));
+ }
+
+ if (readBytes != sizeof(disk_state))
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": read %d of %zu",
+ path, readBytes, sizeof(disk_state))));
+ }
+
+ COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+
+ if (last_state == max_replication_slots)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state could be found, increase max_replication_slots")));
+
+ /* copy data to shared memory */
+ replication_states[last_state].roident = disk_state.roident;
+ replication_states[last_state].remote_lsn = disk_state.remote_lsn;
+ last_state++;
+
+ elog(LOG, "recovered replication state of node %u to %X/%X",
+ disk_state.roident,
+ (uint32)(disk_state.remote_lsn >> 32),
+ (uint32)disk_state.remote_lsn);
+ }
+
+ /* now check checksum */
+ FIN_CRC32C(crc);
+ if (file_crc != crc)
+ ereport(PANIC,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("replication_slot_checkpoint has wrong checksum %u, expected %u",
+ crc, file_crc)));
+
+ CloseTransientFile(fd);
+}
+
+void
+replorigin_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case XLOG_REPLORIGIN_SET:
+ {
+ xl_replorigin_set *xlrec =
+ (xl_replorigin_set *) XLogRecGetData(record);
+
+ replorigin_advance(xlrec->node_id,
+ xlrec->remote_lsn, record->EndRecPtr,
+ xlrec->force /* backward */,
+ false /* WAL log */);
+ break;
+ }
+ case XLOG_REPLORIGIN_DROP:
+ {
+ xl_replorigin_drop *xlrec;
+ int i;
+
+ xlrec = (xl_replorigin_drop *) XLogRecGetData(record);
+
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state = &replication_states[i];
+
+ /* found our slot */
+ if (state->roident == xlrec->node_id)
+ {
+ /* reset entry */
+ state->roident = InvalidRepOriginId;
+ state->remote_lsn = InvalidXLogRecPtr;
+ state->local_lsn = InvalidXLogRecPtr;
+ break;
+ }
+ }
+ break;
+ }
+ default:
+ elog(PANIC, "replident_redo: unknown op code %u", info);
+ }
+}
+
+
+/*
+ * Tell the replication origin progress machinery that a commit from 'node'
+ * that originated at the LSN remote_commit on the remote node was replayed
+ * successfully and that we don't need to do so again. In combination with
+ * setting up replident_sesssion_origin_lsn and replident_sesssion_origin that ensures we
+ * won't loose knowledge about that after a crash if the the transaction had a
+ * persistent effect (think of asynchronous commits).
+ *
+ * local_commit needs to be a local LSN of the commit so that we can make sure
+ * uppon a checkpoint that enough WAL has been persisted to disk.
+ *
+ * Needs to be called with a RowExclusiveLock on pg_replication_origin,
+ * unless running in recovery.
+ */
+void
+replorigin_advance(RepOriginId node,
+ XLogRecPtr remote_commit, XLogRecPtr local_commit,
+ bool go_backward, bool wal_log)
+{
+ int i;
+ ReplicationState *replication_state = NULL;
+ ReplicationState *free_state = NULL;
+
+ Assert(node != InvalidRepOriginId);
+
+ /* we don't track DoNotReplicateId */
+ if (node == DoNotReplicateId)
+ return;
+
+ /*
+ * XXX: For the case where this is called by WAL replay, it'd be more
+ * efficient to restore into a backend local hashtable and only dump into
+ * shmem after recovery is finished. Let's wait with implementing that
+ * till it's shown to be a measurable expense
+ */
+
+ /* Lock exclusively, as we may have to create a new table entry. */
+ LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
+
+ /*
+ * Search for either an existing slot for the origin, or a free one we can
+ * use.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &replication_states[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->roident == InvalidRepOriginId &&
+ free_state == NULL)
+ {
+ free_state = curstate;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->roident != node)
+ {
+ continue;
+ }
+
+ /* ok, found slot */
+ replication_state = curstate;
+
+ LWLockAcquire(&replication_state->lock, LW_EXCLUSIVE);
+
+ /* Make sure it's not used by somebody else */
+ if (replication_state->acquired_by != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication origin with oid %d is already active for pid %d",
+ replication_state->roident,
+ replication_state->acquired_by)));
+ }
+
+ break;
+ }
+
+ if (replication_state == NULL && free_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state slot could be found for replication origin with oid %u",
+ node),
+ errhint("Increase max_replication_slots and try again.")));
+
+ if (replication_state == NULL)
+ {
+ /* initialize new slot */
+ LWLockAcquire(&free_state->lock, LW_EXCLUSIVE);
+ replication_state = free_state;
+ Assert(replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(replication_state->local_lsn == InvalidXLogRecPtr);
+ replication_state->roident = node;
+ }
+
+ Assert(replication_state->roident != InvalidRepOriginId);
+
+ /*
+ * If somebody "forcefully" sets this slot, WAL log it, so it's durable
+ * and the standby gets the message. Primarily this will be called during
+ * WAL replay (of commit records) where no WAL logging is necessary.
+ */
+ if (wal_log)
+ {
+ xl_replorigin_set xlrec;
+ xlrec.remote_lsn = remote_commit;
+ xlrec.node_id = node;
+ xlrec.force = go_backward;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+
+ XLogInsert(RM_REPLORIGIN_ID, XLOG_REPLORIGIN_SET);
+ }
+
+ /*
+ * Due to - harmless - race conditions during a checkpoint we could see
+ * values here that are older than the ones we already have in
+ * memory. Don't overwrite those.
+ */
+ if (go_backward || replication_state->remote_lsn < remote_commit)
+ replication_state->remote_lsn = remote_commit;
+ if (local_commit != InvalidXLogRecPtr &&
+ (go_backward || replication_state->local_lsn < local_commit))
+ replication_state->local_lsn = local_commit;
+ LWLockRelease(&replication_state->lock);
+
+ /*
+ * Release *after* changing the LSNs, slot isn't acquired and thus could
+ * otherwise be dropped anytime.
+ */
+ LWLockRelease(ReplicationOriginLock);
+}
+
+
+XLogRecPtr
+replorigin_get_progress(RepOriginId node, bool flush)
+{
+ int i;
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ XLogRecPtr remote_lsn = InvalidXLogRecPtr;
+
+ /* prevent slots from being concurrently dropped */
+ LWLockAcquire(ReplicationOriginLock, LW_SHARED);
+
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state;
+
+ state = &replication_states[i];
+
+ if (state->roident == node)
+ {
+ LWLockAcquire(&state->lock, LW_SHARED);
+
+ remote_lsn = state->remote_lsn;
+ local_lsn = state->local_lsn;
+
+ LWLockRelease(&state->lock);
+
+ break;
+ }
+ }
+
+ LWLockRelease(ReplicationOriginLock);
+
+ if (flush && local_lsn != InvalidXLogRecPtr)
+ XLogFlush(local_lsn);
+
+ return remote_lsn;
+}
+
+/*
+ * Tear down a (possibly) configured session replication origin during process
+ * exit.
+ */
+static void
+ReplicationOriginExitCleanup(int code, Datum arg)
+{
+
+ LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
+
+ if (session_replication_state != NULL &&
+ session_replication_state->acquired_by == MyProcPid)
+ {
+ session_replication_state->acquired_by = 0;
+ session_replication_state = NULL;
+ }
+
+ LWLockRelease(ReplicationOriginLock);
+}
+
+/*
+ * Setup a replication origin in the shared memory struct if it doesn't
+ * already exists and cache access to the specific ReplicationSlot so the
+ * array doesn't have to be searched when calling
+ * replorigin_session_advance().
+ *
+ * Obviously only one such cached origin can exist per process and the current
+ * cached value can only be set again after the previous value is torn down
+ * with replorigin_session_reset().
+ */
+void
+replorigin_session_setup(RepOriginId node)
+{
+ static bool registered_cleanup;
+ int i;
+ int free_slot = -1;
+
+ if (!registered_cleanup)
+ {
+ on_shmem_exit(ReplicationOriginExitCleanup, 0);
+ registered_cleanup = true;
+ }
+
+ Assert(max_replication_slots > 0);
+
+ if (session_replication_state != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot setup replication origin when one is already setup")));
+
+ /* Lock exclusively, as we may have to create a new table entry. */
+ LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
+
+ /*
+ * Search for either an existing slot for the origin, or a free one we can
+ * use.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *curstate = &replication_states[i];
+
+ /* remember where to insert if necessary */
+ if (curstate->roident == InvalidRepOriginId &&
+ free_slot == -1)
+ {
+ free_slot = i;
+ continue;
+ }
+
+ /* not our slot */
+ if (curstate->roident != node)
+ continue;
+
+ else if (curstate->acquired_by != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication identiefer %d is already active for pid %d",
+ curstate->roident, curstate->acquired_by)));
+ }
+
+ /* ok, found slot */
+ session_replication_state = curstate;
+ }
+
+
+ if (session_replication_state == NULL && free_slot == -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no free replication state slot could be found for replication origin with oid %u",
+ node),
+ errhint("Increase max_replication_slots and try again.")));
+ else if (session_replication_state == NULL)
+ {
+ /* initialize new slot */
+ session_replication_state = &replication_states[free_slot];
+ Assert(session_replication_state->remote_lsn == InvalidXLogRecPtr);
+ Assert(session_replication_state->local_lsn == InvalidXLogRecPtr);
+ session_replication_state->roident = node;
+ }
+
+
+ Assert(session_replication_state->roident != InvalidRepOriginId);
+
+ session_replication_state->acquired_by = MyProcPid;
+
+ LWLockRelease(ReplicationOriginLock);
+}
+
+/*
+ * Reset replay state previously setup in this session.
+ *
+ * This function may only be called if a origin was setup with
+ * replident_session_setup().
+ */
+void
+replorigin_session_reset(void)
+{
+ Assert(max_replication_slots != 0);
+
+ if (session_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("no replication origin is configured")));
+
+ LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE);
+
+ session_replication_state->acquired_by = 0;
+ session_replication_state = NULL;
+
+ LWLockRelease(ReplicationOriginLock);
+}
+
+/*
+ * Do the same work replorigin_advance() does, just on the session's
+ * configured origin.
+ *
+ * This is noticeably cheaper than using replorigin_advance().
+ */
+void
+replorigin_session_advance(XLogRecPtr remote_commit, XLogRecPtr local_commit)
+{
+ Assert(session_replication_state != NULL);
+ Assert(session_replication_state->roident != InvalidRepOriginId);
+
+ LWLockAcquire(&session_replication_state->lock, LW_EXCLUSIVE);
+ if (session_replication_state->local_lsn < local_commit)
+ session_replication_state->local_lsn = local_commit;
+ if (session_replication_state->remote_lsn < remote_commit)
+ session_replication_state->remote_lsn = remote_commit;
+ LWLockRelease(&session_replication_state->lock);
+}
+
+/*
+ * Ask the machinery about the point up to which we successfully replayed
+ * changes from a already setup replication origin.
+ */
+XLogRecPtr
+replorigin_session_get_progress(bool flush)
+{
+ XLogRecPtr remote_lsn;
+ XLogRecPtr local_lsn;
+
+ Assert(session_replication_state != NULL);
+
+ LWLockAcquire(&session_replication_state->lock, LW_SHARED);
+ remote_lsn = session_replication_state->remote_lsn;
+ local_lsn = session_replication_state->local_lsn;
+ LWLockRelease(&session_replication_state->lock);
+
+ if (flush && local_lsn != InvalidXLogRecPtr)
+ XLogFlush(local_lsn);
+
+ return remote_lsn;
+}
+
+
+
+/* ---------------------------------------------------------------------------
+* SQL functions for working with replication origin.
+ *
+ * These mostly should be fairly short wrappers around more generic functions.
+ * ---------------------------------------------------------------------------
+ */
+
+/*
+ * Create replication origin for the passed in name, and return the assigned
+ * oid.
+ */
+Datum
+pg_replication_origin_create(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepOriginId roident;
+
+ replorigin_check_prerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ roident = replorigin_create(name);
+
+ pfree(name);
+
+ PG_RETURN_OID(roident);
+}
+
+/*
+ * Drop replication origin.
+ */
+Datum
+pg_replication_origin_drop(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepOriginId roident;
+
+ replorigin_check_prerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+
+ roident = replorigin_by_name(name, false);
+ Assert(OidIsValid(roident));
+
+ replorigin_drop(roident);
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Return oid of a replication origin.
+ */
+Datum
+pg_replication_origin_oid(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepOriginId roident;
+
+ replorigin_check_prerequisites(false);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ roident = replorigin_by_name(name, true);
+
+ pfree(name);
+
+ if (OidIsValid(roident))
+ PG_RETURN_OID(roident);
+ PG_RETURN_NULL();
+}
+
+/*
+ * Setup a replication origin for this session.
+ */
+Datum
+pg_replication_origin_session_setup(PG_FUNCTION_ARGS)
+{
+ char *name;
+ RepOriginId origin;
+
+ replorigin_check_prerequisites(true);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ origin = replorigin_by_name(name, false);
+ replorigin_session_setup(origin);
+
+ replident_sesssion_origin = origin;
+
+ pfree(name);
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Reset previously setup origin in this session
+ */
+Datum
+pg_replication_origin_session_reset(PG_FUNCTION_ARGS)
+{
+ replorigin_check_prerequisites(true);
+
+ replorigin_session_reset();
+
+ /* FIXME */
+ replident_sesssion_origin = InvalidRepOriginId;
+ replident_sesssion_origin_lsn = InvalidXLogRecPtr;
+ replident_sesssion_origin_timestamp = 0;
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Has a replication origin been setup for this session.
+ */
+Datum
+pg_replication_origin_session_is_setup(PG_FUNCTION_ARGS)
+{
+ replorigin_check_prerequisites(false);
+
+ PG_RETURN_BOOL(replident_sesssion_origin != InvalidRepOriginId);
+}
+
+
+/*
+ * Return the replication progress for origin setup in the current session.
+ *
+ * If 'flush' is set to true it is ensured that the returned value corresponds
+ * to a local transaction that has been flushed. this is useful if asychronous
+ * commits are used when replaying replicated transactions.
+ */
+Datum
+pg_replication_origin_session_progress(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr remote_lsn = InvalidXLogRecPtr;
+ bool flush = PG_GETARG_BOOL(0);
+
+ replorigin_check_prerequisites(true);
+
+ if (session_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("no replication origin is configured")));
+
+ remote_lsn = replorigin_session_get_progress(flush);
+
+ if (remote_lsn == InvalidXLogRecPtr)
+ PG_RETURN_NULL();
+
+ PG_RETURN_LSN(remote_lsn);
+}
+
+Datum
+pg_replication_origin_xact_setup(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr location = PG_GETARG_LSN(0);
+
+ replorigin_check_prerequisites(true);
+
+ if (session_replication_state == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("no replication origin is configured")));
+
+ replident_sesssion_origin_lsn = location;
+ replident_sesssion_origin_timestamp = PG_GETARG_TIMESTAMPTZ(1);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+pg_replication_origin_xact_reset(PG_FUNCTION_ARGS)
+{
+ replorigin_check_prerequisites(true);
+
+ replident_sesssion_origin_lsn = InvalidXLogRecPtr;
+ replident_sesssion_origin_timestamp = 0;
+
+ PG_RETURN_VOID();
+}
+
+
+Datum
+pg_replication_origin_advance(PG_FUNCTION_ARGS)
+{
+ text *name = PG_GETARG_TEXT_P(0);
+ XLogRecPtr remote_commit = PG_GETARG_LSN(1);
+ RepOriginId node;
+
+ replorigin_check_prerequisites(true);
+
+ /* lock to prevent the replication origin from vanishing */
+ LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+
+ node = replorigin_by_name(text_to_cstring(name), false);
+
+ /*
+ * Can't sensibly pass a local commit to be flushed at checkpoint - this
+ * xact hasn't committed yet. This is why this function should be used to
+ * set up the intial replication state, but not for replay.
+ */
+ replorigin_advance(node, remote_commit, InvalidXLogRecPtr,
+ true /* go backward */, true /* wal log */);
+
+ UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+
+/*
+ * Return the replication progress for an individual replication origin.
+ *
+ * If 'flush' is set to true it is ensured that the returned value corresponds
+ * to a local transaction that has been flushed. this is useful if asychronous
+ * commits are used when replaying replicated transactions.
+ */
+Datum
+pg_replication_origin_progress(PG_FUNCTION_ARGS)
+{
+ char *name;
+ bool flush;
+ RepOriginId roident;
+ XLogRecPtr remote_lsn = InvalidXLogRecPtr;
+
+ replorigin_check_prerequisites(true);
+
+ name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0)));
+ flush = PG_GETARG_BOOL(1);
+
+ roident = replorigin_by_name(name, false);
+ Assert(OidIsValid(roident));
+
+ remote_lsn = replorigin_get_progress(roident, flush);
+
+ if (remote_lsn == InvalidXLogRecPtr)
+ PG_RETURN_NULL();
+
+ PG_RETURN_LSN(remote_lsn);
+}
+
+
+Datum
+pg_show_replication_origin_status(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ int i;
+#define REPLICATION_ORIGIN_PROGRESS_COLS 4
+
+ /* we we want to return 0 rows if slot is set to zero */
+ replorigin_check_prerequisites(false);
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not allowed in this context")));
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (tupdesc->natts != REPLICATION_ORIGIN_PROGRESS_COLS)
+ elog(ERROR, "wrong function definition");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* prevent slots from being concurrently dropped */
+ LWLockAcquire(ReplicationOriginLock, LW_SHARED);
+
+ /*
+ * Iterate through all possible replication_states, display if they are
+ * filled. Note that we do not take any locks, so slightly corrupted/out
+ * of date values are a possibility.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationState *state;
+ Datum values[REPLICATION_ORIGIN_PROGRESS_COLS];
+ bool nulls[REPLICATION_ORIGIN_PROGRESS_COLS];
+ char *roname;
+
+ state = &replication_states[i];
+
+ /* unused slot, nothing to display */
+ if (state->roident == InvalidRepOriginId)
+ continue;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 1, sizeof(nulls));
+
+ values[0] = ObjectIdGetDatum(state->roident);
+ nulls[0] = false;
+
+ /*
+ * We're not preventing the origin to be dropped concurrently, so
+ * silently accept that it might be gone.
+ */
+ if (replorigin_by_oid(state->roident, true,
+ &roname))
+ {
+ values[1] = CStringGetTextDatum(roname);
+ nulls[1] = false;
+ }
+
+ LWLockAcquire(&state->lock, LW_SHARED);
+
+ values[ 2] = LSNGetDatum(state->remote_lsn);
+ nulls[2] = false;
+
+ values[3] = LSNGetDatum(state->local_lsn);
+ nulls[3] = false;
+
+ LWLockRelease(&state->lock);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ tuplestore_donestoring(tupstore);
+
+ LWLockRelease(ReplicationOriginLock);
+
+#undef REPLICATION_ORIGIN_PROGRESS_COLS
+
+ return (Datum) 0;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index dc85583..c9c1d10 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1255,7 +1255,8 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
void
ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time)
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
@@ -1273,6 +1274,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
/* serialize the last bunch of changes if we need start earlier anyway */
if (txn->nentries_mem != txn->nentries)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 16b9808..32ac58f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "replication/slot.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "replication/origin.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/ipc.h"
@@ -132,6 +133,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
size = add_size(size, CheckpointerShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
+ size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -238,6 +240,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
CheckpointerShmemInit();
AutoVacuumShmemInit();
ReplicationSlotsShmemInit();
+ ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index bd27168..62816a9 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -54,6 +54,7 @@
#include "catalog/pg_shdepend.h"
#include "catalog/pg_shdescription.h"
#include "catalog/pg_shseclabel.h"
+#include "catalog/pg_replication_origin.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_tablespace.h"
#include "catalog/pg_ts_config.h"
@@ -620,6 +621,28 @@ static const struct cachedesc cacheinfo[] = {
},
128
},
+ {ReplicationOriginRelationId, /* REPLORIGIDENT */
+ ReplicationOriginIdentIndex,
+ 1,
+ {
+ Anum_pg_replication_origin_roident,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
+ {ReplicationOriginRelationId, /* REPLORIGNAME */
+ ReplicationOriginNameIndex,
+ 1,
+ {
+ Anum_pg_replication_origin_roname,
+ 0,
+ 0,
+ 0
+ },
+ 16
+ },
{RewriteRelationId, /* RULERELNAME */
RewriteRelRulenameIndexId,
2,
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index a0805d8..4a22575 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -56,6 +56,8 @@
#include "common/restricted_token.h"
#include "storage/large_object.h"
#include "pg_getopt.h"
+#include "replication/logical.h"
+#include "replication/origin.h"
static ControlFileData ControlFile; /* pg_control values */
@@ -1091,6 +1093,7 @@ WriteEmptyXLOG(void)
record->xl_tot_len = SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort + sizeof(CheckPoint);
record->xl_info = XLOG_CHECKPOINT_SHUTDOWN;
record->xl_rmid = RM_XLOG_ID;
+
recptr += SizeOfXLogRecord;
*(recptr++) = XLR_BLOCK_ID_DATA_SHORT;
*(recptr++) = sizeof(CheckPoint);
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 93d1217..ad44db3 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -13,6 +13,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
+#include "replication/origin.h"
#include "utils/guc.h"
@@ -21,18 +22,13 @@ extern PGDLLIMPORT bool track_commit_timestamp;
extern bool check_track_commit_timestamp(bool *newval, void **extra,
GucSource source);
-typedef uint32 CommitTsNodeId;
-#define InvalidCommitTsNodeId 0
-
-extern void CommitTsSetDefaultNodeId(CommitTsNodeId nodeid);
-extern CommitTsNodeId CommitTsGetDefaultNodeId(void);
extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
- CommitTsNodeId nodeid, bool do_xlog);
+ RepOriginId nodeid, bool do_xlog);
extern bool TransactionIdGetCommitTsData(TransactionId xid,
- TimestampTz *ts, CommitTsNodeId *nodeid);
+ TimestampTz *ts, RepOriginId *nodeid);
extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
- CommitTsNodeId *nodeid);
+ RepOriginId *nodeid);
extern Size CommitTsShmemBuffers(void);
extern Size CommitTsShmemSize(void);
@@ -58,7 +54,7 @@ extern void AdvanceOldestCommitTs(TransactionId oldestXact);
typedef struct xl_commit_ts_set
{
TimestampTz timestamp;
- CommitTsNodeId nodeid;
+ RepOriginId nodeid;
TransactionId mainxid;
/* subxact Xids follow */
} xl_commit_ts_set;
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 48f04c6..47033da 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -44,3 +44,4 @@ PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fdf3ea3..9e78403 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -131,6 +131,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_RELFILENODES (1U << 2)
#define XACT_XINFO_HAS_INVALS (1U << 3)
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
+#define XACT_XINFO_HAS_ORIGIN (1U << 5)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -217,6 +218,12 @@ typedef struct xl_xact_twophase
} xl_xact_twophase;
#define MinSizeOfXactInvals offsetof(xl_xact_invals, msgs)
+typedef struct xl_xact_origin
+{
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_origin;
+
typedef struct xl_xact_commit
{
TimestampTz xact_time; /* time of commit */
@@ -227,6 +234,7 @@ typedef struct xl_xact_commit
/* xl_xact_relfilenodes follows if XINFO_HAS_RELFILENODES */
/* xl_xact_invals follows if XINFO_HAS_INVALS */
/* xl_xact_twophase follows if XINFO_HAS_TWOPHASE */
+ /* xl_xact_origin follows if XINFO_HAS_ORIGIN */
} xl_xact_commit;
#define MinSizeOfXactCommit (offsetof(xl_xact_commit, xact_time) + sizeof(TimestampTz))
@@ -267,6 +275,9 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
typedef struct xl_xact_parsed_abort
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2b1f423..f08b676 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -85,6 +85,7 @@ typedef enum
} RecoveryTargetType;
extern XLogRecPtr XactLastRecEnd;
+extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd;
extern bool reachedConsistency;
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index deca1de..75cf435 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
/*
* Each page of XLOG file has a header like this:
*/
-#define XLOG_PAGE_MAGIC 0xD083 /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD085 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 6638c1d..18a3e7c 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -45,6 +45,12 @@ typedef uint64 XLogSegNo;
typedef uint32 TimeLineID;
/*
+ * Replication origin id - this is located in this file to avoid having to
+ * include origin.h in a bunch of xlog related places.
+ */
+typedef uint16 RepOriginId;
+
+/*
* Because O_DIRECT bypasses the kernel buffers, and because we never
* read those buffers except during crash recovery or if wal_level != minimal,
* it is a win to use it in all cases where we sync on each write(). We could
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 6864c95..ac60929 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -39,6 +39,7 @@
/* prototypes for public functions in xloginsert.c: */
extern void XLogBeginInsert(void);
+extern void XLogIncludeOrigin(void);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
extern void XLogRegisterData(char *data, int len);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 609bfe3..5164abe 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -127,6 +127,8 @@ struct XLogReaderState
uint32 main_data_len; /* main data portion's length */
uint32 main_data_bufsz; /* allocated size of the buffer */
+ RepOriginId record_origin;
+
/* information about blocks referenced by the record. */
DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
@@ -186,6 +188,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
#define XLogRecGetData(decoder) ((decoder)->main_data)
#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index b487ae0..7a049f0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -212,5 +212,6 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
+#define XLR_BLOCK_ID_ORIGIN 253
#endif /* XLOGRECORD_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 86d1402..f9cbbfd 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201504171
+#define CATALOG_VERSION_NO 201504123
#endif
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index a680229..fefe313 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -305,6 +305,12 @@ DECLARE_UNIQUE_INDEX(pg_policy_oid_index, 3257, on pg_policy using btree(oid oid
DECLARE_UNIQUE_INDEX(pg_policy_polrelid_polname_index, 3258, on pg_policy using btree(polrelid oid_ops, polname name_ops));
#define PolicyPolrelidPolnameIndexId 3258
+DECLARE_UNIQUE_INDEX(pg_replication_origin_roiident_index, 6001, on pg_replication_origin using btree(roident oid_ops));
+#define ReplicationOriginIdentIndex 6001
+
+DECLARE_UNIQUE_INDEX(pg_replication_origin_roname_index, 6002, on pg_replication_origin using btree(roname varchar_pattern_ops));
+#define ReplicationOriginNameIndex 6002
+
/* last step of initialization script: build the indexes declared above */
BUILD_INDICES
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a3cc91b..7df02aa 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5201,6 +5201,42 @@ DESCR("for use by pg_upgrade");
DATA(insert OID = 3591 ( binary_upgrade_create_empty_extension PGNSP PGUID 12 1 0 0 0 f f f f f f v 7 0 2278 "25 25 16 25 1028 1009 1009" _null_ _null_ _null_ _null_ binary_upgrade_create_empty_extension _null_ _null_ _null_ ));
DESCR("for use by pg_upgrade");
+/* replication/origin.h */
+DATA(insert OID = 6003 ( pg_replication_origin_create PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_origin_create _null_ _null_ _null_ ));
+DESCR("create a replication origin");
+
+DATA(insert OID = 6004 ( pg_replication_origin_drop PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_origin_drop _null_ _null_ _null_ ));
+DESCR("drop replication origin identified by its name");
+
+DATA(insert OID = 6005 ( pg_replication_origin_oid PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 26 "25" _null_ _null_ _null_ _null_ pg_replication_origin_oid _null_ _null_ _null_ ));
+DESCR("translate the replication origin's name to its id");
+
+DATA(insert OID = 6006 ( pg_replication_origin_session_setup PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "25" _null_ _null_ _null_ _null_ pg_replication_origin_session_setup _null_ _null_ _null_ ));
+DESCR("configure session to maintain replication progress tracking for the passed in origin");
+
+DATA(insert OID = 6007 ( pg_replication_origin_session_reset PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_replication_origin_session_reset _null_ _null_ _null_ ));
+DESCR("teardown configured replication progress tracking");
+
+DATA(insert OID = 6008 ( pg_replication_origin_session_is_setup PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_replication_origin_session_is_setup _null_ _null_ _null_ ));
+DESCR("is a replication origin configured in this session");
+
+DATA(insert OID = 6009 ( pg_replication_origin_session_progress PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 3220 "16" _null_ _null_ _null_ _null_ pg_replication_origin_session_progress _null_ _null_ _null_ ));
+DESCR("get the replication progress of the current session");
+
+DATA(insert OID = 6010 ( pg_replication_origin_xact_setup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "3220 1184" _null_ _null_ _null_ _null_ pg_replication_origin_xact_setup _null_ _null_ _null_ ));
+DESCR("setup the transaction's origin lsn and timestamp");
+
+DATA(insert OID = 6011 ( pg_replication_origin_xact_reset PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "3220 1184" _null_ _null_ _null_ _null_ pg_replication_origin_xact_reset _null_ _null_ _null_ ));
+DESCR("reset the transaction's origin lsn and timestamp");
+
+DATA(insert OID = 6012 ( pg_replication_origin_advance PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2278 "25 3220" _null_ _null_ _null_ _null_ pg_replication_origin_advance _null_ _null_ _null_ ));
+DESCR("advance replication itentifier to specific location");
+
+DATA(insert OID = 6013 ( pg_replication_origin_progress PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 3220 "25 16" _null_ _null_ _null_ _null_ pg_replication_origin_progress _null_ _null_ _null_ ));
+DESCR("get an individual replication origin's replication progress");
+
+DATA(insert OID = 6014 ( pg_show_replication_origin_status PGNSP PGUID 12 1 100 0 0 f f f f f t v 0 0 2249 "" "{26,25,3220,3220}" "{o,o,o,o}" "{local_id, external_id, remote_lsn, local_lsn}" _null_ pg_show_replication_origin_status _null_ _null_ _null_ ));
+DESCR("get progress for all replication origins");
/*
* Symbolic values for provolatile column: these indicate whether the result
diff --git a/src/include/catalog/pg_replication_origin.h b/src/include/catalog/pg_replication_origin.h
new file mode 100644
index 0000000..91ecb75
--- /dev/null
+++ b/src/include/catalog/pg_replication_origin.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_replication_origin.h
+ * Persistent replication origin registry
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/pg_replication_origin.h
+ *
+ * NOTES
+ * the genbki.pl script reads this file and generates .bki
+ * information from the DATA() statements.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_REPLICATION_ORIGIN_H
+#define PG_REPLICATION_ORIGIN_H
+
+#include "catalog/genbki.h"
+#include "access/xlogdefs.h"
+
+/* ----------------
+ * pg_replication_origin. cpp turns this into
+ * typedef struct FormData_pg_replication_origin
+ * ----------------
+ */
+#define ReplicationOriginRelationId 6000
+
+CATALOG(pg_replication_origin,6000) BKI_SHARED_RELATION BKI_WITHOUT_OIDS
+{
+ /*
+ * Locally known id that get included into WAL.
+ *
+ * This should never leave the system.
+ *
+ * Needs to fit into a uint16, so we don't waste too much space in WAL
+ * records. For this reason we don't use a normal Oid column here, since
+ * we need to handle allocation of new values manually.
+ */
+ Oid roident;
+
+ /*
+ * Variable-length fields start here, but we allow direct access to
+ * riname.
+ */
+
+ /* external, free-format, origin */
+ text roname BKI_FORCE_NOT_NULL;
+#ifdef CATALOG_VARLEN /* further variable-length fields */
+#endif
+} FormData_pg_replication_origin;
+
+typedef FormData_pg_replication_origin *Form_pg_replication_origin;
+
+/* ----------------
+ * compiler constants for pg_replication_origin
+ * ----------------
+ */
+#define Natts_pg_replication_origin 2
+#define Anum_pg_replication_origin_roident 1
+#define Anum_pg_replication_origin_roname 2
+
+/* ----------------
+ * pg_replication_origin has no initial contents
+ * ----------------
+ */
+
+#endif /* PG_REPLICATION_ORIGIN_H */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index cce4394..dfdbe65 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -97,4 +97,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+
#endif
diff --git a/src/include/replication/origin.h b/src/include/replication/origin.h
new file mode 100644
index 0000000..b9293ec
--- /dev/null
+++ b/src/include/replication/origin.h
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ * replication_identifier.h
+ * Exports from replication/logical/origin.c
+ *
+ * Copyright (c) 2013-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/origin.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_ORIGIN_H
+#define PG_ORIGIN_H
+
+#include "access/xlogdefs.h"
+#include "catalog/pg_replication_origin.h"
+#include "replication/logical.h"
+
+typedef struct xl_replorigin_set
+{
+ XLogRecPtr remote_lsn;
+ RepOriginId node_id;
+ bool force;
+} xl_replorigin_set;
+
+typedef struct xl_replorigin_drop
+{
+ RepOriginId node_id;
+} xl_replorigin_drop;
+
+#define XLOG_REPLORIGIN_SET 0x00
+#define XLOG_REPLORIGIN_DROP 0x10
+
+#define InvalidRepOriginId 0
+#define DoNotReplicateId UINT16_MAX
+
+extern PGDLLIMPORT RepOriginId replident_sesssion_origin;
+extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn;
+extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp;
+
+/* API for querying & manipulating replication origins */
+extern RepOriginId replorigin_by_name(char *name, bool missing_ok);
+extern RepOriginId replorigin_create(char *name);
+extern void replorigin_drop(RepOriginId roident);
+extern bool replorigin_by_oid(RepOriginId roident, bool missing_ok,
+ char **roname);
+
+/* API for querying & manipulating replication progress tracking */
+extern void replorigin_advance(RepOriginId node,
+ XLogRecPtr remote_commit,
+ XLogRecPtr local_commit,
+ bool go_backward, bool wal_log);
+extern XLogRecPtr replorigin_get_progress(RepOriginId node, bool flush);
+
+extern void replorigin_session_advance(XLogRecPtr remote_commit,
+ XLogRecPtr local_commit);
+extern void replorigin_session_setup(RepOriginId node);
+extern void replorigin_session_reset(void);
+extern XLogRecPtr replorigin_session_get_progress(bool flush);
+
+/* Checkpoint/Startup integration */
+extern void CheckPointReplicationOrigin(void);
+extern void StartupReplicationOrigin(void);
+
+/* WAL logging */
+void replorigin_redo(XLogReaderState *record);
+void replorigin_desc(StringInfo buf, XLogReaderState *record);
+const char * replorigin_identify(uint8 info);
+
+/* shared memory allocation */
+extern Size ReplicationOriginShmemSize(void);
+extern void ReplicationOriginShmemInit(void);
+
+/* SQL callable functions */
+extern Datum pg_replication_origin_create(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_drop(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_oid(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_session_setup(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_session_reset(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_session_is_setup(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_session_progress(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_xact_setup(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_xact_reset(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_advance(PG_FUNCTION_ARGS);
+extern Datum pg_replication_origin_progress(PG_FUNCTION_ARGS);
+extern Datum pg_show_replication_origin_status(PG_FUNCTION_ARGS);
+
+#endif /* PG_ORIGIN_H */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 0935c1b..bec1a56 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -74,6 +74,13 @@ typedef void (*LogicalDecodeCommitCB) (
XLogRecPtr commit_lsn);
/*
+ * Filter changes by origin.
+ */
+typedef bool (*LogicalDecodeFilterByOriginCB) (
+ struct LogicalDecodingContext *,
+ RepOriginId origin_id);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (
@@ -89,6 +96,7 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f1e0f57..6a5528a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -68,6 +68,8 @@ typedef struct ReorderBufferChange
/* The type of change. */
enum ReorderBufferChangeType action;
+ RepOriginId origin_id;
+
/*
* Context data for the change, which part of the union is valid depends
* on action/action_internal.
@@ -166,6 +168,10 @@ typedef struct ReorderBufferTXN
*/
XLogRecPtr restart_decoding_lsn;
+ /* origin of the change that caused this transaction */
+ RepOriginId origin_id;
+ XLogRecPtr origin_lsn;
+
/*
* Commit time, only known when we read the actual commit record.
*/
@@ -339,7 +345,7 @@ void ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
void ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time);
+ TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e3c2efc..cff3b99 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -134,8 +134,9 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
#define ReplicationSlotControlLock (&MainLWLockArray[37].lock)
#define CommitTsControlLock (&MainLWLockArray[38].lock)
#define CommitTsLock (&MainLWLockArray[39].lock)
+#define ReplicationOriginLock (&MainLWLockArray[40].lock)
-#define NUM_INDIVIDUAL_LWLOCKS 40
+#define NUM_INDIVIDUAL_LWLOCKS 41
/*
* It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index ba0b090..b875251 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,8 @@ enum SysCacheIdentifier
RANGETYPE,
RELNAMENSP,
RELOID,
+ REPLORIGIDENT,
+ REPLORIGNAME,
RULERELNAME,
STATRELATTINH,
TABLESPACEOID,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 25095e5..f7f016b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1390,6 +1390,11 @@ pg_prepared_xacts| SELECT p.transaction,
FROM ((pg_prepared_xact() p(transaction, gid, prepared, ownerid, dbid)
LEFT JOIN pg_authid u ON ((p.ownerid = u.oid)))
LEFT JOIN pg_database d ON ((p.dbid = d.oid)));
+pg_replication_origin_status| SELECT pg_show_replication_origin_status.local_id,
+ pg_show_replication_origin_status.external_id,
+ pg_show_replication_origin_status.remote_lsn,
+ pg_show_replication_origin_status.local_lsn
+ FROM pg_show_replication_origin_status() pg_show_replication_origin_status(local_id, external_id, remote_lsn, local_lsn);
pg_replication_slots| SELECT l.slot_name,
l.plugin,
l.slot_type,
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index c7be273..324bf91 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -121,6 +121,7 @@ pg_pltemplate|t
pg_policy|t
pg_proc|t
pg_range|t
+pg_replication_origin|t
pg_rewrite|t
pg_seclabel|t
pg_shdepend|t
--
2.4.0.rc2.1.g3d6bc9a
On 24/04/15 14:32, Andres Freund wrote:
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
On 04/17/2015 11:54 AM, Andres Freund wrote:
I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called "pg_replication_identifier".
But that's like talking about "index identifiers", instead of just indexes,
and calling the system table pg_index_oid.The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I think
we should roughly do a search-replace of "replication identifier" ->
"replication origin" to the patch. Or even "transaction origin".Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.
There are few oversights in the renaming:
doc/src/sgml/func.sgml:
+ Return the replay position for the passed in replication
+ identifier. The parameter <parameter>flush</parameter>
src/include/replication/origin.h:
+ * replication_identifier.h
----
+extern PGDLLIMPORT RepOriginId replident_sesssion_origin;
+extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn;
+extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp;
(these are used then in multiple places in code afterwards and also
mentioned in comment above replorigin_advance)
src/backend/replication/logical/origin.c:
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication identiefer
----
+ default:
+ elog(PANIC, "replident_redo: unknown op code
----
+ * This function may only be called if a origin was setup with
+ * replident_session_setup().
I also think the "replident_checkpoint" file should be renamed to
"replorigin_checkpoint".
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/24/15 8:32 AM, Andres Freund wrote:
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
On 04/17/2015 11:54 AM, Andres Freund wrote:
I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called "pg_replication_identifier".
But that's like talking about "index identifiers", instead of just indexes,
and calling the system table pg_index_oid.The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I think
we should roughly do a search-replace of "replication identifier" ->
"replication origin" to the patch. Or even "transaction origin".Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.I'm pretty happy with this state.
Shouldn't this be backed up by pg_dump(all?)?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On April 24, 2015 10:26:23 PM GMT+02:00, Peter Eisentraut <peter_e@gmx.net> wrote:
On 4/24/15 8:32 AM, Andres Freund wrote:
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
On 04/17/2015 11:54 AM, Andres Freund wrote:
I've attached a rebased patch, that adds decision about origin
logging
to the relevant XLogInsert() callsites for "external" 2 byte
identifiers
and removes the pad-reusing version in the interest of moving
forward.
Putting aside the 2 vs. 4 byte identifier issue, let's discuss
naming:
I just realized that it talks about "replication identifier" as the
new
fundamental concept. The system table is called
"pg_replication_identifier".
But that's like talking about "index identifiers", instead of just
indexes,
and calling the system table pg_index_oid.
The important concept this patch actually adds is the *origin* of
each
transaction. That term is already used in some parts of the patch. I
think
we should roughly do a search-replace of "replication identifier" ->
"replication origin" to the patch. Or even "transaction origin".Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.I'm pretty happy with this state.
Shouldn't this be backed up by pg_dump(all?)?
Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense.
Andres
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/24/15 4:29 PM, Andres Freund wrote:
Shouldn't this be backed up by pg_dump(all?)?
Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense.
I agree it might not be the best match. But we should consider that we
used to say, a backup by pg_dumpall plus configuration files is a
complete backup. Now we have replication slots and possibly replication
identifiers and maybe similar features in the future that are not
covered by this backup method.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 28, 2015 at 10:00 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
On 4/24/15 4:29 PM, Andres Freund wrote:
Shouldn't this be backed up by pg_dump(all?)?
Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense.
I agree it might not be the best match. But we should consider that we
used to say, a backup by pg_dumpall plus configuration files is a
complete backup. Now we have replication slots and possibly replication
identifiers and maybe similar features in the future that are not
covered by this backup method.
That's true. But if you did backup the replication slots with
pg_dump, and then you restored them as part of restoring the dump,
your replication setup would be just as broken as if you had never
backed up those replication slots at all.
I think the problem here is that replication slots are part of
*cluster* configuration, not individual node configuration. If you
back up a set of nodes that make up a cluster, and then restore them,
you might hope that you will end up with working slots established
between the same pairs of machines that had working slots between them
before. But I don't see a way to make that happen when you look at it
from the point of view of backing up and restoring just one node. As
we get better clustering facilities into core, we may develop more
instances of this problem; the best way of solving it is not clear to
me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 25 April 2015 at 00:32, Andres Freund <andres@anarazel.de> wrote:
Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.
Hi,
It looks like bowerbird is not too happy with this, neither is my build
environment.
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-04-29%2019%3A31%3A06
The attached seems to fix it.
I put the include in the origin.h rather than in the origin.c as the
DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c.
Regards
David Rowley
Attachments:
UINT16_MAX_fix.patchapplication/octet-stream; name=UINT16_MAX_fix.patchDownload
diff --git a/src/include/replication/origin.h b/src/include/replication/origin.h
index ca26bc3..5093021 100644
--- a/src/include/replication/origin.h
+++ b/src/include/replication/origin.h
@@ -10,6 +10,8 @@
#ifndef PG_ORIGIN_H
#define PG_ORIGIN_H
+#include <stdint.h>
+
#include "access/xlogdefs.h"
#include "catalog/pg_replication_origin.h"
#include "replication/logical.h"
On 29/04/15 22:12, David Rowley wrote:
On 25 April 2015 at 00:32, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.Hi,
It looks like bowerbird is not too happy with this, neither is my build
environment.http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-04-29%2019%3A31%3A06
The attached seems to fix it.
I put the include in the origin.h rather than in the origin.c as the
DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c.
I think correct fix is using PG_UINT16_MAX.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers