Logical decoding on standby

Started by Craig Ringerabout 9 years ago107 messages
#1Craig Ringer
craig@2ndquadrant.com

Hi all

I've prepared a working initial, somewhat raw implementation for
logical decoding on physical standbys. Since it's a series of 20
smallish patches at the moment I won't attach it. You can find the
current version at time of writing here:

https://github.com/postgres/postgres/compare/c5f365f3ab...2ndQuadrant:dev/logical-decoding-on-standby-pg10-v1?expand=1

i.e the tag dev/logical-decoding-on-standby-pg10-v1 in github
repo 2ndQuadrant/postgres.

and whatever I'm working on (subject to rebase, breakage, etc) lives
in the branch dev/logical-decoding-on-standby-pg10 .

Quickstart
===

Compile and install like usual; make sure to install test_decoding
too. To see the functionality in action, configure with
--enable-tap-tests and:

make -C src/test/recovery check

To try manually, initdb a master, set pg_hba.conf to 'trust' on all
replication connections, append to postgresql.conf:

wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
hot_standby_feedback = on

then start the master. Now

psql -d 'master_connstr' -c "SELECT
pg_create_physical_replication_slot('standby1');"

and

pg_basebackup -d 'master_connstr' -X stream -R --slot=standby1

and start the replica.

You can now use pg_recvlogical to create a slot on the replica and
decode changes from it, e.g.

pg_recvlogical -d 'replica_connstr' -S test -P test_decoding --create-slot
pg_recvlogical -d 'replica_connstr' -S 'test' -f - --start

and you'll (hopefully) see subsequent changes you make on the master.
If not, tell me.

Patch series contents
===

This patch series incorporates the following changes:

* Timeline following for logical slots, so they can start decoding on
the correct timeline and follow timeline switches (with tests);
originally [3]/messages/by-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com

* Add --endpos to pg_recvlogical, with tests; originally [3]/messages/by-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com

* Splitting of xmin and catalog_xmin on hot standby feedback, so
logical slots on a replica only hold down catalog_xmin on the master,
not the xmin for user tables (with tests). Minimises upstream bloat;
originally [4]/messages/by-id/CAMsr+YFi-LV7S8ehnwUiZnb=1h_14PwQ25d-vyUNq-f5S5r=Zg@mail.gmail.com

* Suppress export of snapshot when starting logical decoding on
replica. Since we cannot allocate an xid, we cannot export snapshots
on standby. Decoding clients can still do their initial setup via a
slot on the master then switch over, do it via physical copy, etc.

* Require hot_standby_feedback to be enabled when starting logical
decoding on a standby.

* Drop any replication slots from a database when redo'ing database
drop, so we don't leave dangling slots on the replica (with tests).

* Make the walsender respect SIGUSR1 and exit via
RecoveryConflictInterrupt() when it gets
PROCSIG_RECOVERY_CONFLICT_DATABASE (with tests); see [6]/messages/by-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com

* PostgresNode.pm enhancements for the tests

* New test coverage for logical decoding on standby

Remaining issues
===

* The method used to make the walsender respect conflict with recovery
interrupts may not be entirely safe, see walsender
procsignal_sigusr1_handler thread [5]/messages/by-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com;

* We probably terminate walsenders running inside an output plugin
with a virtual xact whose xmin is below the upstream's global xmin,
even though its catalog xmin is fine, in
ResolveRecoveryConflictWithSnapshot(...). Haven't been able to test
this. Need to only terminate them when the conflict affects relations
accessible in logical decoding, which likely needs the upstream to
send more info in WAL;

* logical decoding timeline following needs tests for cascading
physical replication where an intermediate node is promoted per
timeline following thread [3]/messages/by-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com;

* walsender may need to maintain ThisTimeLineID in more places per
decoding timeline following v10 thread [3]/messages/by-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com;

* it may be desirable to refactor the walsender to deliver cleaner
logical decoding timeline following per the decoding timeline
following v10 thread[3]/messages/by-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com

also:

* Nothing stops the user from disabling hot_standby_feedback on the
standby or dropping and re-creating the physical slot on the master,
causing needed catalog tuples to get vacuumed away. Since it's not
going to be safe to check slot shmem state from the
hot_standby_feedback verify hook and we let hot_standby_feedback
change at runtime this is going to be hard to fix comprehensively, so
we need to cope with what happens when feedback fails, but:

* We don't yet detect when upstream's catalog_xmin increases past our
needed catalog_xmin and needed catalog tuples are vacuumed away by the
upstream. So we don't invalidate the slot or terminate any active
decoding sessions using the slot. Active decoding sessions often won't
have a vtxid to use with ResolveRecoveryConflictWithVirtualXIDs(),
transaction cancel is not going to be sufficient, and anyway it'll
cancel too aggressively since it doesn't know it's safe to apply
changes that affect only (non-user-catalog) heap tables without
conflict with decoding sessions.

... so this is definitely NOT ready for commit. It does, however, make
logical decoding work on standby.

Next steps
===

Since it doesn't look practical to ensure there's never been a gap in
hot standby feedback or detect such a gap directly, I'm currently
looking at ways to reliably detect when the upstream has removed
tuples we need and error out. That means we need a way to tell when
upstream's catalog_xmin has advanced, which we don't currently have
from xlogs. Checkpoint's oldestXID is insufficient since advance
could've happened since last checkpoint.

Related threads
===

This series supercedes:

* Timeline following for logical slots
[1]: /messages/by-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com

* WIP: Failover Slots
[2]: /messages/by-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com

and incorporates the patches in:

* Logical decoding timeline following take II
[3]: /messages/by-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com

* Send catalog_xmin separately in hot standby feedback
[4]: /messages/by-id/CAMsr+YFi-LV7S8ehnwUiZnb=1h_14PwQ25d-vyUNq-f5S5r=Zg@mail.gmail.com

* Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() from
walsender?
[5]: /messages/by-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com

Also relevant:

* Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() in walsender
[6]: /messages/by-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#1)
Re: Logical decoding on standby

Hi,

On 2016-11-21 16:17:58 +0800, Craig Ringer wrote:

I've prepared a working initial, somewhat raw implementation for
logical decoding on physical standbys.

Please attach. Otherwise in a year or two it'll be impossible to look
this up.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#2)
18 attachment(s)
Re: Logical decoding on standby

On 22 November 2016 at 03:14, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-11-21 16:17:58 +0800, Craig Ringer wrote:

I've prepared a working initial, somewhat raw implementation for
logical decoding on physical standbys.

Please attach. Otherwise in a year or two it'll be impossible to look
this up.

Fair enough. Attached. Easy to apply with "git am".

I'm currently looking at making detection of replay conflict with a
slot work by separating the current catalog_xmin into two effective
parts - the catalog_xmin currently needed by any known slots
(ProcArray->replication_slot_catalog_xmin, as now), and the oldest
actually valid catalog_xmin where we know we haven't removed anything
yet.

That'll be recorded in a new CheckPoint.oldestCatalogXid field and in
ShmemVariableCache ( i.e. VariableCacheData.oldestCatalogXid ).

Vacuum will be responsible for advancing
VariableCacheData.oldestCatalogXid by writing an expanded
xl_heap_cleanup_info record with a new latestRemovedCatalogXid field
and then advancing the value in the ShmemVariableCache. Vacuum will
only remove rows of catalog or user-catalog tables that are older than
VariableCacheData.oldestCatalogXid.

This allows recovery on a standby to tell, based on the last
checkpoint + any xl_heap_cleanup_info records used to maintain
ShmemVariableCache, whether the upstream has removed catalog or
user-catalog records it needs. We can signal walsenders with running
xacts to terminate if their xmin passes the threshold, and when they
start a new xact they can check to see if they're still valid and bail
out.

xl_heap_cleanup_info isn't emitted much, but if adding a field there
is a problem we can instead add an additional xlog buffer that's only
appended when wal_level = logical.

Doing things this way avoids:

* the need for the standby to be able to tell at redo time whether a
RelFileNode is for a catalog or user-catalog relation without access
to relcache; or
* the need to add info on whether a catalog or user-catalog is being
affected to any xlog record that can cause a snapshot conflict on
standby; or
* a completely reliably way to ensure hot_standby_feedback can never
cease to affect the master even if the user does something dumb

at the cost of sometimes somewhat delaying removal of catalog or
user-catalog tuples when wal_level >= hot_standby, a new CheckPoint
field, and a new field in xl_heap_cleanup_info .

The above is not incorporated in the attached patch series, see the
prior post for status of the attached patches.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0005-Create-new-pg_lsn-class-to-deal-with-awkward-LSNs-in.patchtext/x-patch; charset=US-ASCII; name=0005-Create-new-pg_lsn-class-to-deal-with-awkward-LSNs-in.patchDownload
From 610c323eac56e86767f1e5e16f5b96a449393d38 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 14 Nov 2016 12:19:35 +0800
Subject: [PATCH 05/21] Create new pg_lsn class to deal with awkward LSNs in
 tests

---
 src/test/perl/Makefile        |   3 +
 src/test/perl/pg_lsn.pm       | 144 ++++++++++++++++++++++++++++++++++++++++++
 src/test/perl/t/001_load.pl   |   9 +++
 src/test/perl/t/002_pg_lsn.pl |  68 ++++++++++++++++++++
 4 files changed, 224 insertions(+)
 create mode 100644 src/test/perl/pg_lsn.pm
 create mode 100644 src/test/perl/t/001_load.pl
 create mode 100644 src/test/perl/t/002_pg_lsn.pl

diff --git a/src/test/perl/Makefile b/src/test/perl/Makefile
index 8ab60fc..cdc38f4 100644
--- a/src/test/perl/Makefile
+++ b/src/test/perl/Makefile
@@ -15,6 +15,9 @@ include $(top_builddir)/src/Makefile.global
 
 ifeq ($(enable_tap_tests),yes)
 
+check:
+	$(prove_check)
+
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(pgxsdir)/$(subdir)'
 
diff --git a/src/test/perl/pg_lsn.pm b/src/test/perl/pg_lsn.pm
new file mode 100644
index 0000000..777b3df
--- /dev/null
+++ b/src/test/perl/pg_lsn.pm
@@ -0,0 +1,144 @@
+package pg_lsn;
+
+use strict;
+use warnings;
+
+our (@ISA, @EXPORT_OK);
+BEGIN {
+	require Exporter;
+	@ISA = qw(Exporter);
+	@EXPORT_OK = qw(parse_lsn);
+}
+
+use Scalar::Util qw(blessed looks_like_number);
+use Carp;
+
+use overload
+	'""' => \&Str,
+	'<=>' => \&NumCmp,
+	'bool' => \&Bool,
+	'-' => \&Negate,
+	fallback => 1;
+
+=pod package pg_lsn
+
+A class to encapsulate a PostgreSQL log-sequence number (LSN) and handle conversion
+of its hex representation.
+
+Provides equality and inequality operators.
+
+Calling 'new' on undef or empty string argument returns undef, not an instance.
+
+=cut
+
+sub new_num
+{
+	my ($class, $high, $low) = @_;
+	my $self = bless { '_low' => $low, '_high' => $high } => $class;
+	$self->_constraint;
+	return $self;
+}
+
+sub new
+{
+	my ($class, $lsn_str) = @_;
+	return undef if !defined($lsn_str) || $lsn_str eq '';
+	my ($high, $low) = split('/', $lsn_str, 2);
+	die "malformed LSN" if ($high eq '' || $low eq '');
+	return $class->new_num(hex($high), hex($low));
+}
+
+sub NumCmp
+{
+	my ($self, $other, $swap) = @_;
+	$self->_constraint;
+	die "comparison with undef" unless defined($other);
+	if (!blessed($other))
+	{
+		# coerce from string if needed. Try to coerce any non-object.
+		$other = pg_lsn->new($other) if !blessed($other);
+	}
+	$other->_constraint;
+	# and compare
+	my $ret;
+	if ($self->{'_high'} < $other->{'_high'})
+	{
+		$ret = -1;
+	}
+	elsif ($self->{'_high'} == $other->{'_high'})
+	{
+		if ($self->{'_low'} < $other->{'_low'})
+		{
+			$ret = -1;
+		}
+		elsif ($self->{'_low'} == $other->{'_low'})
+		{
+			$ret = 0;
+		}
+		else
+		{
+			$ret = 1;
+		}
+	}
+	else
+	{
+		$ret = 1;
+	}
+	$ret = -$ret if $swap;
+	return $ret;
+}
+
+sub _constraint
+{
+	my $self = shift;
+	die "high word must be defined" unless (defined($self->{'_high'}));
+	die "high word must be numeric" unless (looks_like_number($self->{'_high'}));
+	die "high word must be in uint32 range" unless ($self->{'_high'} >= 0 && $self->{'_high'} <= 0xFFFFFFFF);
+	die "low word must be defined" unless (defined($self->{'_low'}));
+	die "low word must be numeric" unless (looks_like_number($self->{'_low'}));
+	die "low word must be in uint32 range" unless ($self->{'_low'} >= 0 && $self->{'_low'} <= 0xFFFFFFFF);
+}
+
+sub Bool
+{
+	my $self = shift;
+	$self->_constraint;
+	return $self->{'_high'} || $self->{'_low'};
+}
+
+sub Negate
+{
+	die "cannot negate pg_lsn";
+}
+
+sub Str
+{
+	my $self = shift;
+	return sprintf("%X/%X", $self->high, $self->low);
+}
+
+sub high
+{
+	my $self = shift;
+	return $self->{'_high'};
+}
+
+sub low
+{
+	my $self = shift;
+	return $self->{'_low'};
+}
+
+# Todo: addition/subtraction. Needs to handle wraparound and carrying.
+
+=pod parse_lsn(lsn)
+
+Returns a 2-array of the high and low words of the passed LSN as numbers,
+or undef if argument is the empty string or undef.
+
+=cut 
+
+sub parse_lsn
+{
+	return pg_lsn->new($_[0]);
+}
diff --git a/src/test/perl/t/001_load.pl b/src/test/perl/t/001_load.pl
new file mode 100644
index 0000000..53a39af
--- /dev/null
+++ b/src/test/perl/t/001_load.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+use Test::More tests => 5;
+
+require_ok 'RecursiveCopy';
+require_ok 'SimpleTee';
+require_ok 'TestLib';
+require_ok 'PostgresNode';
+require_ok 'pg_lsn';
diff --git a/src/test/perl/t/002_pg_lsn.pl b/src/test/perl/t/002_pg_lsn.pl
new file mode 100644
index 0000000..73e3d65
--- /dev/null
+++ b/src/test/perl/t/002_pg_lsn.pl
@@ -0,0 +1,68 @@
+use strict;
+use warnings;
+use Test::More tests => 42;
+use Scalar::Util qw(blessed);
+
+use pg_lsn qw(parse_lsn);
+
+ok(!defined(parse_lsn('')), 'parse_lsn of empty string is undef');
+ok(!defined(parse_lsn(undef)), 'parse_lsn of undef is undef');
+
+my $zero_lsn = parse_lsn('0/0');
+ok(blessed($zero_lsn), 'zero lsn blessed');
+ok($zero_lsn->isa("pg_lsn"), 'zero lsn isa pg_lsn');
+is($zero_lsn->{'_high'}, 0, 'zero lsn high word zero');
+is($zero_lsn->{'_low'}, 0, 'zero lsn low word zero');
+cmp_ok($zero_lsn, "==", pg_lsn->new_num(0, 0), 'parse_lsn of 0/0');
+
+cmp_ok(parse_lsn('0/FFFFFFFF'), "==", pg_lsn->new_num(0, 0xFFFFFFFF), 'parse_lsn of 0/FFFFFFFF');
+cmp_ok(parse_lsn('FFFFFFFF/0'), "==", pg_lsn->new_num(0xFFFFFFFF, 0), 'parse_lsn of FFFFFFFF/0');
+cmp_ok(parse_lsn('FFFFFFFF/FFFFFFFF'), "==", pg_lsn->new_num(0xFFFFFFFF, 0xFFFFFFFF), 'parse_lsn of 0xFFFFFFFF/0xFFFFFFFF');
+
+is(parse_lsn('2/2') <=> parse_lsn('2/3'), -1);
+is(parse_lsn('2/2') <=> parse_lsn('2/2'), 0);
+is(parse_lsn('2/2') <=> parse_lsn('2/1'), 1);
+is(parse_lsn('2/2') <=> parse_lsn('3/2'), -1);
+is(parse_lsn('2/2') <=> parse_lsn('1/2'), 1);
+
+cmp_ok(parse_lsn('0/1'), "==", parse_lsn('0/1'));
+ok(!(parse_lsn('0/1') == parse_lsn('0/2')), "! 0/1 == 0/2");
+ok(!(parse_lsn('0/1') == parse_lsn('0/0')), "! 0/1 == 0/0");
+cmp_ok(parse_lsn('1/0'), "==", parse_lsn('1/0'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('1/1'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('2/0'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('0/0'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('0/1'));
+
+cmp_ok(parse_lsn('0/1'), ">=", parse_lsn('0/1'));
+cmp_ok(parse_lsn('0/1'), "<=", parse_lsn('0/1'));
+cmp_ok(parse_lsn('0/1'), "<=", parse_lsn('0/2'));
+cmp_ok(parse_lsn('0/1'), ">=", parse_lsn('0/0'));
+cmp_ok(parse_lsn('1/0'), ">=", parse_lsn('1/0'));
+cmp_ok(parse_lsn('1/0'), "<=", parse_lsn('1/0'));
+cmp_ok(parse_lsn('1/0'), "<=", parse_lsn('2/0'));
+cmp_ok(parse_lsn('1/0'), ">=", parse_lsn('0/0'));
+cmp_ok(parse_lsn('1/1'), ">=", parse_lsn('1/1'));
+cmp_ok(parse_lsn('1/1'), "<=", parse_lsn('1/1'));
+cmp_ok(parse_lsn('1/1'), "<=", parse_lsn('1/2'));
+cmp_ok(parse_lsn('1/2'), ">=", parse_lsn('1/1'));
+
+ok(parse_lsn('1/1'), 'bool conversion');
+ok(! $zero_lsn, 'bool negation');
+
+# implicit string conversions
+cmp_ok(parse_lsn('0/0'), "==", "0/0");
+cmp_ok(parse_lsn('FFFFFFFF/FFFFFFFF'), "==", "FFFFFFFF/FFFFFFFF");
+# swapped string conversions
+cmp_ok("0/0", "==", parse_lsn('0/0'));
+cmp_ok("FFFFFFFF/FFFFFFFF", "==", parse_lsn('FFFFFFFF/FFFFFFFF'));
+
+# negation makes no sense for a uint64
+eval {
+	- parse_lsn('0/1');
+};
+if ($@) {
+	ok('negation raises error');
+} else {
+	fail('negation did not raise error');
+}
-- 
2.5.5

0006-Expand-streaming-replication-tests-to-cover-hot-stan.patchtext/x-patch; charset=US-ASCII; name=0006-Expand-streaming-replication-tests-to-cover-hot-stan.patchDownload
From 63773b6f97148ffb2cc741c4259742b5712ea353 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 9 Nov 2016 13:44:04 +0800
Subject: [PATCH 06/21] Expand streaming replication tests to cover hot standby
 feedback and physical replication slots

---
 src/test/recovery/t/001_stream_rep.pl | 105 +++++++++++++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 5ce69bb..ef29892 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 4;
+use Test::More tests => 22;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -58,3 +58,106 @@ is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
 is($node_standby_2->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 2');
+
+diag "switching to physical replication slot";
+# Switch to using a physical replication slot. We can do this without a new
+# backup since physical slots can go backwards if needed. Do so on both
+# standbys. Since we're going to be testing things that affect the slot state,
+# also increase the standby feedback interval to ensure timely updates.
+my ($slotname_1, $slotname_2) = ('standby_1', 'standby_2');
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_master->restart;
+is($node_master->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_1');]), 0, 'physical slot created on master');
+$node_standby_1->append_conf('recovery.conf', "primary_slot_name = $slotname_1\n");
+$node_standby_1->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_standby_1->restart;
+is($node_standby_1->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_2');]), 0, 'physical slot created on intermediate replica');
+$node_standby_2->append_conf('recovery.conf', "primary_slot_name = $slotname_2\n");
+$node_standby_2->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+$node_standby_2->restart;
+
+sub get_slot_xmins
+{
+	my ($node, $slotname) = @_;
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+# There's no hot standby feedback and there are no logical slots on either peer
+# so xmin and catalog_xmin should be null on both slots.
+my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
+is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
+is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
+
+# Replication still works?
+$node_master->safe_psql('postgres', 'CREATE TABLE replayed(val integer);');
+
+sub replay_check
+{
+	my $newval = $node_master->safe_psql('postgres', 'INSERT INTO replayed(val) SELECT coalesce(max(val),0) + 1 AS newval FROM replayed RETURNING val');
+	$node_master->wait_for_catchup($node_standby_1);
+	$node_standby_1->wait_for_catchup($node_standby_2);
+	$node_standby_1->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
+		or die "standby_1 didn't replay master value $newval";
+	$node_standby_2->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
+		or die "standby_2 didn't replay standby_1 value $newval";
+}
+
+replay_check();
+
+diag "enabling hot_standby_feedback";
+# Enable hs_feedback. The slot should gain an xmin. We set the status interval
+# so we'll see the results promptly.
+$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_1->reload;
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_2->reload;
+replay_check();
+sleep(2);
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+isnt($xmin, '', 'non-cascaded slot xmin non-null with hs feedback');
+is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+isnt($xmin, '', 'cascaded slot xmin non-null with hs feedback');
+is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback');
+
+diag "doing some work to advance xmin";
+for my $i (10000..11000) {
+	$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES ($i);]);
+}
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my ($xmin2, $catalog_xmin2) = get_slot_xmins($node_master, $slotname_1);
+diag "new xmin $xmin2, old xmin $xmin";
+isnt($xmin2, $xmin, 'non-cascaded slot xmin with hs feedback has changed');
+is($catalog_xmin2, '', 'non-cascaded slot xmin still null with hs_feedback unchanged');
+
+($xmin2, $catalog_xmin2) = get_slot_xmins($node_standby_1, $slotname_2);
+diag "new xmin $xmin2, old xmin $xmin";
+isnt($xmin2, $xmin, 'cascaded slot xmin with hs feedback has changed');
+is($catalog_xmin2, '', 'cascaded slot xmin still null with hs_feedback unchanged');
+
+diag "disabling hot_standby_feedback";
+# Disable hs_feedback. Xmin should be cleared.
+$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_1->reload;
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_2->reload;
+replay_check();
+sleep(2);
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+is($xmin, '', 'non-cascaded slot xmin null with hs feedback reset');
+is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback reset');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+is($xmin, '', 'cascaded slot xmin null with hs feedback reset');
+is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback reset');
-- 
2.5.5

0007-Send-catalog_xmin-in-hot-standby-feedback-protocol.patchtext/x-patch; charset=US-ASCII; name=0007-Send-catalog_xmin-in-hot-standby-feedback-protocol.patchDownload
From 38eab580fb651bf01c0d8e06d534499be270b44c Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 07/21] Send catalog_xmin in hot standby feedback protocol

Add catalog_xmin to the to the hot standby feedback protocol so a read replica
that has logical slots can use its physical slot to the master to hold down the
master's catalog_xmin. This information will let a replica prevent vacuuming of
catalog tuples still required by the replica's logical slots.

This is the hot standby feedback protocol change, the new value is always set
to zero by the walreceiver and is ignored by the walsender.
---
 doc/src/sgml/protocol.sgml            | 33 ++++++++++++++++++++++++++++-----
 src/backend/replication/walreceiver.c | 21 ++++++++++++++-------
 src/backend/replication/walsender.c   | 10 ++++++++--
 3 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 50cf527..e0fd9aa 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1783,10 +1783,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1796,7 +1797,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled. New in 10.0.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby. New in 10.0.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2bb3dce..06ca9e4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1164,8 +1164,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	static bool master_has_standby_xmin = false;
 
@@ -1210,23 +1210,30 @@ XLogWalRcvSendHSFeedback(bool immed)
 	else
 		xmin = InvalidTransactionId;
 
+	catalog_xmin = InvalidTransactionId;
+
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(reply_message.data, reply_message.len);
 	if (TransactionIdIsValid(xmin))
 		master_has_standby_xmin = true;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index ef8ba80..cd749cd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1638,6 +1638,8 @@ ProcessStandbyHSFeedbackMessage(void)
 	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
@@ -1646,10 +1648,14 @@ ProcessStandbyHSFeedbackMessage(void)
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
 	/* Unset WalSender's xmin if the feedback message value is invalid */
 	if (!TransactionIdIsNormal(feedbackXmin))
-- 
2.5.5

0008-Make-walsender-respect-catalog_xmin-in-hot-standby-f.patchtext/x-patch; charset=US-ASCII; name=0008-Make-walsender-respect-catalog_xmin-in-hot-standby-f.patchDownload
From 46b046b2c794cce45bec007f8af69635380d48ce Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:38:40 +0800
Subject: [PATCH 08/21] Make walsender respect catalog_xmin in hot standby
 feedback messages

The walsender now respects the new catalog_xmin field in the hot standby
feedback message. It uses it to set the catalog_xmin field on its physical
replication slot if one is in use. Otherwise it sets its process xmin to the
older of the xmin and catalog_xmin, so the outcome is the same as before
the protocol change.

In the process, factor out walsender's sanity check for xid+epoch wraparound
into a separate TransactionIdInRecentPast() function since we're now checking
it in two places.
---
 src/backend/replication/walsender.c | 111 +++++++++++++++++++++++++++---------
 1 file changed, 84 insertions(+), 27 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cd749cd..ac8c2c3 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -216,6 +216,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1536,6 +1537,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1598,7 +1604,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1619,6 +1625,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1629,13 +1651,46 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
 	TransactionId feedbackCatalogXmin;
@@ -1643,7 +1698,8 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
@@ -1657,37 +1713,30 @@ ProcessStandbyHSFeedbackMessage(void)
 		 feedbackCatalogXmin,
 		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1712,15 +1761,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
-- 
2.5.5

0009-Allow-GetOldestXmin-.-to-optionally-disregard-the-ca.patchtext/x-patch; charset=US-ASCII; name=0009-Allow-GetOldestXmin-.-to-optionally-disregard-the-ca.patchDownload
From 5c7b117aa3572f1c8566f033b99d242baf1e7190 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 16:13:35 +0800
Subject: [PATCH 09/21] Allow GetOldestXmin(...) to optionally disregard the
 catalog_xmin

Add a new catalog_xmin out-parameter to GetOldestXmin(...), for use when
calculating hot standby feedback xmins. When passed, any needed catalog_xmin is
returned separately instead of being merged with the return value. Adjust
existing call sites.
---
 contrib/pg_visibility/pg_visibility.c |  4 +--
 contrib/pgstattuple/pgstatapprox.c    |  2 +-
 src/backend/access/transam/xlog.c     |  4 +--
 src/backend/catalog/index.c           |  2 +-
 src/backend/commands/analyze.c        |  2 +-
 src/backend/commands/vacuum.c         |  4 +--
 src/backend/replication/walreceiver.c |  2 +-
 src/backend/storage/ipc/procarray.c   | 51 +++++++++++++++++++++++------------
 src/include/storage/procarray.h       |  2 +-
 9 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9985e3e..4fa3ad4 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index f524fc4..5b33c97 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..56c672c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8653,7 +8653,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, NULL));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9016,7 +9016,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, NULL));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 08b646d..b673c06 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2272,7 +2272,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, NULL);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c617abb..9b0cc3a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -992,7 +992,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, NULL);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 58bbf55..aaee9a6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -497,7 +497,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, NULL), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -909,7 +909,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, NULL);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 06ca9e4..80cc482 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1206,7 +1206,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+		xmin = GetOldestXmin(NULL, false, NULL);
 	else
 		xmin = InvalidTransactionId;
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5d487d..a4e3549 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1298,17 +1298,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, TransactionId *catalog_xmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1433,17 +1438,29 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
-	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
-	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	if (!(rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+		replication_slot_catalog_xmin = InvalidXLogRecPtr;
+
+	if (catalog_xmin != NULL)
+	{
+		/*
+		 * The caller wants any logical decoding specific xmin reported
+		 * separately, so don't merge it with the xmin we'll return.
+		 */
+		*catalog_xmin = replication_slot_catalog_xmin;
+	}
+	else
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * possible. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
 
 	return result;
 }
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index dd37c0c..f7d1d96 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, TransactionId *catalog_xmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
-- 
2.5.5

0001-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patchtext/x-patch; charset=UTF-8; name=0001-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patchDownload
From 78743234bbb44dd43c3f1d2bdf727a84895bc29b Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 12:37:40 +0800
Subject: [PATCH 01/21] Add an optional --endpos LSN argument to pg_recvlogical
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

pg_recvlogical usually just runs until cancelled or until the upstream
server disconnects. For some purposes, especially testing, it's useful
to have the ability to stop receive at a specified LSN without having
to parse the output and deal with buffering issues, etc.

Add a --endpos parameter that takes the LSN at which no further
messages should be written and receive should stop.

Craig Ringer, Álvaro Herrera
---
 doc/src/sgml/ref/pg_recvlogical.sgml   |  34 ++++++++
 src/bin/pg_basebackup/pg_recvlogical.c | 145 +++++++++++++++++++++++++++++----
 2 files changed, 164 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index b35881f..d066ce8 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -38,6 +38,14 @@ PostgreSQL documentation
    constraints as <xref linkend="app-pgreceivexlog">, plus those for logical
    replication (see <xref linkend="logicaldecoding">).
   </para>
+
+  <para>
+   <command>pg_recvlogical</> has no equivalent to the logical decoding
+   SQL interface's peek and get modes. It sends replay confirmations for
+   data lazily as it receives it and on clean exit. To examine pending data on
+    a slot without consuming it, use
+   <link linkend="functions-replication"><function>pg_logical_slot_peek_changes</></>.
+  </para>
  </refsect1>
 
  <refsect1>
@@ -155,6 +163,32 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-E <replaceable>lsn</replaceable></option></term>
+      <term><option>--endpos=<replaceable>lsn</replaceable></option></term>
+      <listitem>
+       <para>
+        In <option>--start</option> mode, automatically stop replication
+        and exit with normal exit status 0 when receiving reaches the
+        specified LSN.  If specified when not in <option>--start</option>
+        mode, an error is raised.
+       </para>
+
+       <para>
+        If there's a record with LSN exactly equal to <replaceable>lsn</>,
+        the record will be output.
+       </para>
+
+       <para>
+        The <option>--endpos</option> option is not aware of transaction
+        boundaries and may truncate output partway through a transaction.
+        Any partially output transaction will not be consumed and will be
+        replayed again when the slot is next read from. Individual messages
+        are never truncated.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>--if-not-exists</option></term>
       <listitem>
        <para>
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index cb5f989..c700edf 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -40,6 +40,7 @@ static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
 static XLogRecPtr startpos = InvalidXLogRecPtr;
+static XLogRecPtr endpos = InvalidXLogRecPtr;
 static bool do_create_slot = false;
 static bool slot_exists_ok = false;
 static bool do_start_slot = false;
@@ -63,6 +64,9 @@ static XLogRecPtr output_fsync_lsn = InvalidXLogRecPtr;
 static void usage(void);
 static void StreamLogicalLog(void);
 static void disconnect_and_exit(int code);
+static bool flushAndSendFeedback(PGconn *conn, TimestampTz *now);
+static void prepareToTerminate(PGconn *conn, XLogRecPtr endpos,
+				   bool keepalive, XLogRecPtr lsn);
 
 static void
 usage(void)
@@ -81,6 +85,7 @@ usage(void)
 			 "                         time between fsyncs to the output file (default: %d)\n"), (fsync_interval / 1000));
 	printf(_("      --if-not-exists    do not error if slot already exists when creating a slot\n"));
 	printf(_("  -I, --startpos=LSN     where in an existing slot should the streaming start\n"));
+	printf(_("  -E, --endpos=LSN       exit after receiving the specified LSN\n"));
 	printf(_("  -n, --no-loop          do not loop on connection lost\n"));
 	printf(_("  -o, --option=NAME[=VALUE]\n"
 			 "                         pass option NAME with optional value VALUE to the\n"
@@ -281,6 +286,7 @@ StreamLogicalLog(void)
 		int			bytes_written;
 		int64		now;
 		int			hdr_len;
+		XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
 
 		if (copybuf != NULL)
 		{
@@ -454,6 +460,7 @@ StreamLogicalLog(void)
 			int			pos;
 			bool		replyRequested;
 			XLogRecPtr	walEnd;
+			bool		endposReached = false;
 
 			/*
 			 * Parse the keepalive message, enclosed in the CopyData message.
@@ -476,18 +483,32 @@ StreamLogicalLog(void)
 			}
 			replyRequested = copybuf[pos];
 
-			/* If the server requested an immediate reply, send one. */
-			if (replyRequested)
+			if (endpos != InvalidXLogRecPtr && walEnd >= endpos)
 			{
-				/* fsync data, so we send a recent flush pointer */
-				if (!OutputFsync(now))
-					goto error;
+				/*
+				 * If there's nothing to read on the socket until a keepalive
+				 * we know that the server has nothing to send us; and if
+				 * walEnd has passed endpos, we know nothing else can have
+				 * committed before endpos.  So we can bail out now.
+				 */
+				endposReached = true;
+			}
 
-				now = feGetCurrentTimestamp();
-				if (!sendFeedback(conn, now, true, false))
+			/* Send a reply, if necessary */
+			if (replyRequested || endposReached)
+			{
+				if (!flushAndSendFeedback(conn, &now))
 					goto error;
 				last_status = now;
 			}
+
+			if (endposReached)
+			{
+				prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr);
+				time_to_abort = true;
+				break;
+			}
+
 			continue;
 		}
 		else if (copybuf[0] != 'w')
@@ -497,7 +518,6 @@ StreamLogicalLog(void)
 			goto error;
 		}
 
-
 		/*
 		 * Read the header of the XLogData message, enclosed in the CopyData
 		 * message. We only need the WAL location field (dataStart), the rest
@@ -515,12 +535,23 @@ StreamLogicalLog(void)
 		}
 
 		/* Extract WAL location for this block */
-		{
-			XLogRecPtr	temp = fe_recvint64(&copybuf[1]);
+		cur_record_lsn = fe_recvint64(&copybuf[1]);
 
-			output_written_lsn = Max(temp, output_written_lsn);
+		if (endpos != InvalidXLogRecPtr && cur_record_lsn > endpos)
+		{
+			/*
+			 * We've read past our endpoint, so prepare to go away being
+			 * cautious about what happens to our output data.
+			 */
+			if (!flushAndSendFeedback(conn, &now))
+				goto error;
+			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			time_to_abort = true;
+			break;
 		}
 
+		output_written_lsn = Max(cur_record_lsn, output_written_lsn);
+
 		bytes_left = r - hdr_len;
 		bytes_written = 0;
 
@@ -557,10 +588,29 @@ StreamLogicalLog(void)
 					strerror(errno));
 			goto error;
 		}
+
+		if (endpos != InvalidXLogRecPtr && cur_record_lsn == endpos)
+		{
+			/* endpos was exactly the record we just processed, we're done */
+			if (!flushAndSendFeedback(conn, &now))
+				goto error;
+			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			time_to_abort = true;
+			break;
+		}
 	}
 
 	res = PQgetResult(conn);
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	if (PQresultStatus(res) == PGRES_COPY_OUT)
+	{
+		/*
+		 * We're doing a client-initiated clean exit and have sent CopyDone to
+		 * the server. We've already sent replay confirmation and fsync'd so
+		 * we can just clean up the connection now.
+		 */
+		goto error;
+	}
+	else if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
 		fprintf(stderr,
 				_("%s: unexpected termination of replication stream: %s"),
@@ -638,6 +688,7 @@ main(int argc, char **argv)
 		{"password", no_argument, NULL, 'W'},
 /* replication options */
 		{"startpos", required_argument, NULL, 'I'},
+		{"endpos", required_argument, NULL, 'E'},
 		{"option", required_argument, NULL, 'o'},
 		{"plugin", required_argument, NULL, 'P'},
 		{"status-interval", required_argument, NULL, 's'},
@@ -673,7 +724,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "f:F:nvd:h:p:U:wWI:E:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -733,6 +784,16 @@ main(int argc, char **argv)
 				}
 				startpos = ((uint64) hi) << 32 | lo;
 				break;
+			case 'E':
+				if (sscanf(optarg, "%X/%X", &hi, &lo) != 2)
+				{
+					fprintf(stderr,
+							_("%s: could not parse end position \"%s\"\n"),
+							progname, optarg);
+					exit(1);
+				}
+				endpos = ((uint64) hi) << 32 | lo;
+				break;
 			case 'o':
 				{
 					char	   *data = pg_strdup(optarg);
@@ -857,6 +918,16 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (endpos != InvalidXLogRecPtr && !do_start_slot)
+	{
+		fprintf(stderr,
+				_("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
+				progname);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -923,8 +994,8 @@ main(int argc, char **argv)
 		if (time_to_abort)
 		{
 			/*
-			 * We've been Ctrl-C'ed. That's not an error, so exit without an
-			 * errorcode.
+			 * We've been Ctrl-C'ed or reached an exit limit condition. That's
+			 * not an error, so exit without an errorcode.
 			 */
 			disconnect_and_exit(0);
 		}
@@ -943,3 +1014,47 @@ main(int argc, char **argv)
 		}
 	}
 }
+
+/*
+ * Fsync our output data, and send a feedback message to the server.  Returns
+ * true if successful, false otherwise.
+ *
+ * If successful, *now is updated to the current timestamp just before sending
+ * feedback.
+ */
+static bool
+flushAndSendFeedback(PGconn *conn, TimestampTz *now)
+{
+	/* flush data to disk, so that we send a recent flush pointer */
+	if (!OutputFsync(*now))
+		return false;
+	*now = feGetCurrentTimestamp();
+	if (!sendFeedback(conn, *now, true, false))
+		return false;
+
+	return true;
+}
+
+/*
+ * Try to inform the server about of upcoming demise, but don't wait around or
+ * retry on failure.
+ */
+static void
+prepareToTerminate(PGconn *conn, XLogRecPtr endpos, bool keepalive, XLogRecPtr lsn)
+{
+	(void) PQputCopyEnd(conn, NULL);
+	(void) PQflush(conn);
+
+	if (verbose)
+	{
+		if (keepalive)
+			fprintf(stderr, "%s: endpos %X/%X reached by keepalive\n",
+					progname,
+					(uint32) (endpos >> 32), (uint32) endpos);
+		else
+			fprintf(stderr, "%s: endpos %X/%X reached by record at %X/%X\n",
+					progname, (uint32) (endpos >> 32), (uint32) (endpos),
+					(uint32) (lsn >> 32), (uint32) lsn);
+
+	}
+}
-- 
2.5.5

0002-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchtext/x-patch; charset=US-ASCII; name=0002-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchDownload
From 2204f65f216c93c6eb3ca9366fd687420b8b4fcf Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 15 Nov 2016 16:06:16 +0800
Subject: [PATCH 02/21] Add a pg_recvlogical wrapper to PostgresNode

---
 src/test/perl/PostgresNode.pm               | 75 ++++++++++++++++++++++++++++-
 src/test/recovery/t/006_logical_decoding.pl | 31 +++++++++++-
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index c1b16ca..b2e4813 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1125,7 +1125,7 @@ sub psql
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
-			  if (blessed($exc_save) || $exc_save ne $timeout_exception);
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
 
 			$ret = undef;
 
@@ -1325,6 +1325,79 @@ sub run_log
 	TestLib::run_log(@_);
 }
 
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...)
+
+Invoke pg_recvlogical to read from slot_name on dbname until LSN endpos, which
+corresponds to pg_recvlogical --endpos.  Gives up after timeout (if nonzero).
+
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
+
+Plugin options are passed as additional keyword arguments.
+
+If called in scalar context, returns stdout, and die()s on timeout or nonzero return.
+
+If called in array context, returns a tuple of (retval, stdout, stderr, timeout).
+timeout is the IPC::Run::Timeout object whose is_expired method can be tested
+to check for timeout. retval is undef on timeout.
+
+=cut
+
+sub pg_recvlogical_upto
+{
+	my ($self, $dbname, $slot_name, $endpos, $timeout_secs, %plugin_options) = @_;
+	my ($stdout, $stderr);
+
+	my $timeout_exception = 'pg_recvlogical timed out';
+
+	my @cmd = ('pg_recvlogical', '-S', $slot_name, '--dbname', $self->connstr($dbname));
+	push @cmd, '--endpos', $endpos if ($endpos);
+	push @cmd, '-f', '-', '--no-loop', '--start';
+
+	while (my ($k, $v) = each %plugin_options)
+	{
+		die "= is not permitted to appear in replication option name" if ($k =~ qr/=/);
+		push @cmd, "-o", "$k=$v";
+	}
+
+	my $timeout;
+	$timeout = IPC::Run::timeout($timeout_secs, exception => $timeout_exception ) if $timeout_secs;
+	my $ret = 0;
+
+	do {
+		local $@;
+		eval {
+			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
+			$ret = $?;
+		};
+		my $exc_save = $@;
+		if ($exc_save)
+		{
+			# IPC::Run::run threw an exception. re-throw unless it's a
+			# timeout, which we'll handle by testing is_expired
+			die $exc_save
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
+
+			$ret = undef;
+
+			die "Got timeout exception '$exc_save' but timer not expired?!"
+			  unless $timeout->is_expired;
+
+			die "$exc_save waiting for endpos $endpos with stdout '$stdout', stderr '$stderr'"
+				unless wantarray;
+		}
+	};
+
+	if (wantarray)
+	{
+		return ($ret, $stdout, $stderr, $timeout);
+	}
+	else
+	{
+		die "pg_recvlogical exited with code '$ret', stdout '$stdout' and stderr '$stderr'" if $ret;
+		return $stdout;
+	}
+}
+
 =pod
 
 =back
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index b80a9a9..d8cc8d3 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -1,9 +1,13 @@
 # Testing of logical decoding using SQL interface and/or pg_recvlogical
+#
+# Most logical decoding tests are in contrib/test_decoding. This module
+# is for work that doesn't fit well there, like where server restarts
+# are required.
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -36,5 +40,30 @@ $result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_chan
 chomp($result);
 is($result, '', 'Decoding after fast restart repeats no rows');
 
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
+
+my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
+
+$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
+
 # done with the node
 $node_master->stop;
-- 
2.5.5

0003-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0003-Follow-timeline-switches-in-logical-decoding.patchDownload
From 19f80f344660ec859b398c2affbd2d323083e46b Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 03/21] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 200 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   7 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 347 insertions(+), 22 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..ab15cf3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -660,6 +661,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -675,7 +677,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -702,6 +705,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -750,6 +754,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -770,28 +897,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= read_upto)
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
+
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 318726e..a8f7b76 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -234,13 +234,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
+	ReplicationSlotAcquire(NameStr(*name));
+
 	/* compute the current end-of-wal */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
-	ReplicationSlotAcquire(NameStr(*name));
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	PG_TRY();
 	{
@@ -279,6 +279,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc5e508..ef8ba80 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -756,6 +757,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -984,10 +991,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..8f96728 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -160,6 +160,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 * 
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index d027ea1..f0ee352 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index a847952..d2ff1e9 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0004-PostgresNode-methods-to-wait-for-node-catchup.patchtext/x-patch; charset=US-ASCII; name=0004-PostgresNode-methods-to-wait-for-node-catchup.patchDownload
From 987fe6ff1bb1fb3be5b7e79d63b943d2f64a0a30 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 14 Nov 2016 12:27:17 +0800
Subject: [PATCH 04/21] PostgresNode methods to wait for node catchup

---
 src/test/perl/PostgresNode.pm         | 120 +++++++++++++++++++++++++++++++++-
 src/test/recovery/t/001_stream_rep.pl |  12 +---
 2 files changed, 120 insertions(+), 12 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index c1b16ca..28e9f0b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -93,6 +93,7 @@ use RecursiveCopy;
 use Socket;
 use Test::More;
 use TestLib ();
+use pg_lsn qw(parse_lsn);
 use Scalar::Util qw(blessed);
 
 our @EXPORT = qw(
@@ -1121,7 +1122,6 @@ sub psql
 		my $exc_save = $@;
 		if ($exc_save)
 		{
-
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
@@ -1173,7 +1173,7 @@ sub psql
 		  if $ret == 1;
 		die "connection error: '$$stderr'\nwhile running '@psql_params'"
 		  if $ret == 2;
-		die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
+		die "error running SQL: '$$stderr'\nwhile running '@psql_params' with sql '$sql'"
 		  if $ret == 3;
 		die "psql returns $ret: '$$stderr'\nwhile running '@psql_params'";
 	}
@@ -1325,6 +1325,122 @@ sub run_log
 	TestLib::run_log(@_);
 }
 
+=pod $node->lsn
+
+Return pg_current_xlog_insert_location() or, on a replica,
+pg_last_xlog_replay_location().
+
+=cut
+
+sub lsn
+{
+	my $self = shift;
+	return $self->safe_psql('postgres', 'select case when pg_is_in_recovery() then pg_last_xlog_replay_location() else pg_current_xlog_insert_location() end as lsn;');
+}
+
+=pod $node->wait_for_catchup(standby_name, mode, target_lsn)
+
+Wait for the node with application_name standby_name (usually from node->name)
+until its replication equals or passes the upstream's xlog insert point at the
+time this function is called. By default the replay_location is waited for,
+but 'mode' may be specified to wait for any of sent|write|flush|replay.
+
+If there is no active replication connection from this peer, waits until
+poll_query_until timeout.
+
+Requires that the 'postgres' db exists and is accessible.
+
+If pos is passed, use that xlog position instead of the server's current
+xlog insert position.
+
+This is not a test. It die()s on failure.
+
+Returns the LSN caught up to.
+
+=cut
+
+sub wait_for_catchup
+{
+	my ($self, $standby_name, $mode, $target_lsn) = @_;
+	$mode = defined($mode) ? $mode : 'replay';
+	my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1 );
+	die "valid modes are " . join(', ', keys(%valid_modes)) unless exists($valid_modes{$mode});
+	if ( blessed( $standby_name ) && $standby_name->isa("PostgresNode") ) {
+		$standby_name = $standby_name->name;
+	}
+	if (!defined($target_lsn)) {
+		$target_lsn = $self->lsn;
+	}
+	$self->poll_query_until('postgres', qq[SELECT '$target_lsn' <= ${mode}_location FROM pg_catalog.pg_stat_replication WHERE application_name = '$standby_name';])
+		or die "timed out waiting for catchup";
+	return $target_lsn;
+}
+
+=pod $node->wait_for_slot_catchup(slot_name, mode, target_lsn)
+
+Wait for the named replication slot to equal or pass the xlog position of the
+server, or the supplied target_lsn if given. The position used is the
+restart_lsn unless mode is given, in which case it may be 'restart' or
+'confirmed_flush'.
+
+Requires that the 'postgres' db exists and is accessible.
+
+This is not a test. It die()s on failure.
+
+If the slot is not active, will time out after poll_query_until's timeout.
+
+Note that for logical slots, restart_lsn is held down by the oldest in progress tx.
+
+Returns the LSN caught up to.
+
+=cut
+
+sub wait_for_slot_catchup
+{
+	my ($self, $slot_name, $mode, $target_lsn) = @_;
+	$mode = defined($mode) ? $mode : 'restart';
+	if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
+		die "valid modes are restart, confirmed_flush";
+	}
+	if (!defined($target_lsn)) {
+		$target_lsn = $self->lsn;
+	}
+	$self->poll_query_until('postgres', qq[SELECT '$target_lsn' <= ${mode}_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name';])
+		or die "timed out waiting for catchup";
+	return $target_lsn;
+}
+
+=pod $node->slot(slot_name)
+
+Return hash-ref of replication slot data for the named slot, or a hash-ref with
+all values '' if not found. Does not differentiate between null and empty string
+for fields, no field is ever undef.
+
+The restart_lsn and confirmed_flush_lsn fields are returned verbatim, and also
+as a 2-list of [highword, lowword] integer. Since we rely on Perl 5.8.8 we can't
+"use bigint", it's from 5.20, and we can't assume we have Math::Bigint from CPAN
+either.
+
+=cut
+
+sub slot
+{
+	my ($self, $slot_name) = @_;
+	my @fields = ('plugin', 'slot_type', 'datoid', 'database', 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+	my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ', @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'");
+	$result = undef if $result eq '';
+	# hash slice, see http://stackoverflow.com/a/16755894/398670 .
+	#
+	# Fills the hash with empty strings produced by x-operator element
+	# duplication if result is an empty row
+	#
+	my %val;
+	@val{@fields} = $result ne '' ? split(qr/\|/, $result) : ('',) x scalar(@fields);
+	$val{'restart_lsn_arr'} = parse_lsn($val{'restart_lsn'});
+	$val{'confirmed_flush_lsn_arr'} = parse_lsn($val{'confirmed_flush_lsn'});
+	return \%val;
+}
+
 =pod
 
 =back
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 981c00b..5ce69bb 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -40,16 +40,8 @@ $node_master->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
 
 # Wait for standbys to catch up
-my $applname_1 = $node_standby_1->name;
-my $applname_2 = $node_standby_2->name;
-my $caughtup_query =
-"SELECT pg_current_xlog_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_1';";
-$node_master->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby 1 to catch up";
-$caughtup_query =
-"SELECT pg_last_xlog_replay_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_2';";
-$node_standby_1->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby 2 to catch up";
+$node_master->wait_for_catchup($node_standby_1);
+$node_standby_1->wait_for_catchup($node_standby_2);
 
 my $result =
   $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-- 
2.5.5

0010-Send-catalog_xmin-separately-in-hot_standby_feedback.patchtext/x-patch; charset=US-ASCII; name=0010-Send-catalog_xmin-separately-in-hot_standby_feedback.patchDownload
From e1c021cb684743fcbf2033591562790cb2d1cdf4 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 16:23:57 +0800
Subject: [PATCH 10/21] Send catalog_xmin separately in hot_standby_feedback
 messages

Now that the protocol supports reporting catalog_xmin separately and
GetOldestXmin() allows us to exclude the catalog_xmin from the calculated xmin,
actually send a separate catalog_xmin to the master.

This change is necessary, but not sufficient, to allow logical decoding
on a standby.
---
 src/backend/replication/walreceiver.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 80cc482..318d8ce 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1206,11 +1206,21 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false, NULL);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL, false, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
-
-	catalog_xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
@@ -1235,7 +1245,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	pq_sendint(&reply_message, catalog_xmin, 4);
 	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
-- 
2.5.5

0011-Update-comment-on-issues-with-logical-decoding-on-st.patchtext/x-patch; charset=US-ASCII; name=0011-Update-comment-on-issues-with-logical-decoding-on-st.patchDownload
From 799bb299c705fea3fb5de990d7f8454a4054a908 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 16:33:37 +0800
Subject: [PATCH 11/21] Update comment on issues with logical decoding on
 standby

---
 src/backend/replication/logical/logical.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1512be5..85f8f0e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -88,16 +88,28 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * To allow logical decoding on a standby we must ensure that:
 	 *
-	 * There's basically three things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
+	 *	  LSN belongs to so we can follow timeline switches
+	 *
 	 * 2) We need to force hot_standby_feedback to be enabled at all times so
 	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
+	 *
+	 * 3) ensure a replication slot is used to connect to the upstream so
+	 *    we know the catalog_xmin is persistent even over connection loss.
+	 *
+	 * 4) support dropping replication slots referring to a database, in
 	 *	  dbase_redo. There can't be any active ones due to HS recovery
 	 *	  conflicts, so that should be relatively easy.
+	 *
+	 * This means we can't allow logical decoding from a standby that's only
+	 * configured for archive recovery. It would be OK to run temporarily in
+	 * archive recovery during connectivity drops so long as we have a slot
+	 * with a catalog_xmin set; it'd cause extra bloat on the master until we
+	 * can reconnect, but that's unavoidable. We don't currently have any
+	 * book-keeping about whether we have a slot unless it's in active use,
+	 * though, so we have to assume there's no slot.
 	 * ----
 	 */
 	if (RecoveryInProgress())
-- 
2.5.5

0012-Don-t-attempt-to-export-a-snapshot-from-CREATE_REPLI.patchtext/x-patch; charset=US-ASCII; name=0012-Don-t-attempt-to-export-a-snapshot-from-CREATE_REPLI.patchDownload
From 03f8f13ca905ac32f9870177fb05926d9bb8d3b5 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 1 Jun 2016 14:05:58 +0800
Subject: [PATCH 12/21] Don't attempt to export a snapshot from
 CREATE_REPLICATION_SLOT when in recovery

Exporting a snapshot requires us to start an xact, and we can't do that from a
server in recovery. So skip snapshot export. The client must handle syncing of
initial state via some external means like a slot on the master or manually
stopping replay from a physical copy at the same LSN.
---
 src/backend/replication/walsender.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index ac8c2c3..957ae36 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -842,7 +842,18 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * Export a plain (not of the snapbuild.c type) snapshot to the user
 		 * that can be imported into another session.
 		 */
-		snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		if (!RecoveryInProgress())
+		{
+			snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		}
+		else
+		{
+			/*
+			 * Can't assign an xid during recovery so we can't export a
+			 * snapshot.
+			 */
+			snapshot_name = "";
+		}
 
 		/* don't need the decoding context anymore */
 		FreeDecodingContext(ctx);
-- 
2.5.5

0013-ERROR-if-timeline-is-zero-in-walsender.patchtext/x-patch; charset=US-ASCII; name=0013-ERROR-if-timeline-is-zero-in-walsender.patchDownload
From 9542d6dd7d6481920ce39c9d7743e066face81fe Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 1 Jun 2016 13:50:52 +0800
Subject: [PATCH 13/21] ERROR if timeline is zero in walsender

---
 src/backend/replication/walsender.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 957ae36..327dbb2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -520,6 +520,11 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 
+	if (ThisTimeLineID == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("run IDENTIFY_SYSTEM before trying to START_REPLICATION")));
+
 	/*
 	 * We assume here that we're logging enough information in the WAL for
 	 * log-shipping, since this is checked in PostmasterMain().
-- 
2.5.5

0014-Permit-logical-decoding-on-standby-with-a-warning.patchtext/x-patch; charset=US-ASCII; name=0014-Permit-logical-decoding-on-standby-with-a-warning.patchDownload
From f092c53d802ce4e0d80055b645fba79fb4fb1f98 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 3 Nov 2016 15:40:22 +0800
Subject: [PATCH 14/21] Permit logical decoding on standby with a warning

---
 src/backend/replication/logical/logical.c | 36 +++++++++++++++----------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 85f8f0e..5f27452 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -88,34 +88,32 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * To allow logical decoding on a standby we must ensure that:
+	 * Logical decoding from a standby is only safe if:
 	 *
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to so we can follow timeline switches
+	 * 1) hot_standby_feedback is enabled, so catalog tuples still needed
+	 *    by the replica are not removed by the master. We already include
+	 *    slots' required xmin in the oldest global xmin up to the master;
 	 *
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
+	 * 2) A physical replication slot is used to connect the standby
+	 *    to the master, so we can store the xmin (and catalog_xmin,
+	 *    once we send it separately) on the slot and we don't lose
+	 *    needed tuples to vacuum if we lose our connection;
 	 *
-	 * 3) ensure a replication slot is used to connect to the upstream so
-	 *    we know the catalog_xmin is persistent even over connection loss.
+	 * 3) We drop replication slots referring to a database in dbase_redo
+	 *    when the database is dropped on the master.
 	 *
-	 * 4) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 * We should really send the xmin and catalog_xmin separately in hot standby
+	 * feedback, so we don't hold down vacuum of all tables to the level we only
+	 * really need for the catalogs.
 	 *
-	 * This means we can't allow logical decoding from a standby that's only
-	 * configured for archive recovery. It would be OK to run temporarily in
-	 * archive recovery during connectivity drops so long as we have a slot
-	 * with a catalog_xmin set; it'd cause extra bloat on the master until we
-	 * can reconnect, but that's unavoidable. We don't currently have any
-	 * book-keeping about whether we have a slot unless it's in active use,
-	 * though, so we have to assume there's no slot.
+	 * In this first draft approach all three requirements are asserted by
+	 * telling the user "don't do that", so emit a warning.
 	 * ----
 	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
+		ereport(WARNING,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+				 errmsg("logical decoding during recovery is experimental")));
 }
 
 /*
-- 
2.5.5

0015-Tests-for-logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0015-Tests-for-logical-decoding-on-standby.patchDownload
From dbe0e37d71bf32853f63ee5ca5961d0b2a7d827c Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 3 Nov 2016 16:18:36 +0800
Subject: [PATCH 15/21] Tests for logical decoding on standby

---
 .../recovery/t/010_logical_decoding_on_replica.pl  | 168 +++++++++++++++++++++
 1 file changed, 168 insertions(+)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..0d869e4
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,168 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdout, $stderr, $ret);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 4\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->append_conf('postgresql.conf', "log_error_verbosity = verbose\n");
+$node_master->append_conf('postgresql.conf', "hot_standby_feedback = on\n");
+# send status rapidly so we promptly advance xmin on master
+$node_master->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('postgres'), '--xlog-method=stream', '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+    my ($xmin, $catalog_xmin) = split(qr/\|/, $node_master->safe_psql('postgres', q[SELECT xmin, catalog_xmin FROM pg_replication_slots WHERE slot_name = 'decoding_standby';]));
+	return ($xmin, $catalog_xmin);
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+is($node_replica->psql('postgres', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+
+sub print_logical_xmin
+{
+    my ($xmin, $catalog_xmin) = split(qr/\|/, $node_replica->safe_psql('postgres', q[SELECT xmin, catalog_xmin FROM pg_replication_slots WHERE slot_name = 'standby_logical';]));
+	return ($xmin, $catalog_xmin);
+}
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, 'psql:<stdin>:1: WARNING:  logical decoding during recovery is experimental', 'stderr is warning');
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+for my $i (0 .. 1000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', 'VACUUM');
+
+($ret, $stdout, $stderr) = $node_replica->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical catalog_xmin not null");
+isnt($new_logical_catalog_xmin, $logical_catalog_xmin, "logical catalog_xmin changed");
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys xmin now the standby's slot doesn't
+# hold it down as far.
+isnt($new_physical_xmin, $physical_xmin, "physical xmin changed");
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_replica->psql('postgres', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+is($catalog_xmin, '', "physical catalog_xmin null");
-- 
2.5.5

0016-Drop-logical-replication-slots-when-redoing-database.patchtext/x-patch; charset=US-ASCII; name=0016-Drop-logical-replication-slots-when-redoing-database.patchDownload
From 344c629ff43d9a44cff55eea5968e693ba1ee4f4 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 17 Nov 2016 15:25:29 +0800
Subject: [PATCH 16/21] Drop logical replication slots when redoing database
 drop

When a standby has logical replication slots on its database, drop them
as part of redoing database drop.
---
 src/backend/commands/dbcommands.c                  |   6 ++
 src/backend/replication/slot.c                     |  72 +++++++++++++
 src/include/replication/slot.h                     |   1 +
 .../recovery/t/010_logical_decoding_on_replica.pl  | 118 ++++++++++++++++++---
 4 files changed, 182 insertions(+), 15 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 0919ad8..3efc833 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2119,11 +2119,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 0b2575e..426f0d0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -758,6 +758,78 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+
+	if (max_replication_slots <= 0)
+		return;
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Deactivate the slot in memory */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			s->active_pid = 0;
+			s->in_use = false;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots.
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/* and purge it from disk */
+		sprintf(path, "pg_replslot/%s", NameStr(slotname));
+
+		/* if deletion fails we want to bail out and force retry of recovery */
+		if (!rmtree(path, true))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove directory \"%s\" for slot \"%s\"",
+					 		path, NameStr(slotname))));
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index e00562d..4ad2bcf 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -175,6 +175,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
index 0d869e4..7934e9f 100644
--- a/src/test/recovery/t/010_logical_decoding_on_replica.pl
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 28;
+use Test::More tests => 43;
 use RecursiveCopy;
 use File::Copy;
 
@@ -28,10 +28,12 @@ $node_master->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n
 $node_master->dump_info;
 $node_master->start;
 
-$node_master->safe_psql('postgres', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
 $backup_name = 'b1';
 my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
-TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('postgres'), '--xlog-method=stream', '--write-recovery-conf', '--slot=decoding_standby');
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--xlog-method=stream', '--write-recovery-conf', '--slot=decoding_standby');
 
 open(my $fh, "<", $backup_dir . "/recovery.conf")
   or die "can't open recovery.conf";
@@ -50,8 +52,8 @@ ok($found, "using physical slot for standby");
 
 sub print_phys_xmin
 {
-    my ($xmin, $catalog_xmin) = split(qr/\|/, $node_master->safe_psql('postgres', q[SELECT xmin, catalog_xmin FROM pg_replication_slots WHERE slot_name = 'decoding_standby';]));
-	return ($xmin, $catalog_xmin);
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
 }
 
 my ($xmin, $catalog_xmin) = print_phys_xmin();
@@ -77,13 +79,13 @@ ok($xmin, "xmin not null");
 ok(!$catalog_xmin, "catalog_xmin null");
 
 # Create new slots on the replica, ignoring the ones on the master completely.
-is($node_replica->psql('postgres', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
    0, 'logical slot creation on standby succeeded');
 
 sub print_logical_xmin
 {
-    my ($xmin, $catalog_xmin) = split(qr/\|/, $node_replica->safe_psql('postgres', q[SELECT xmin, catalog_xmin FROM pg_replication_slots WHERE slot_name = 'standby_logical';]));
-	return ($xmin, $catalog_xmin);
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
 }
 
 $node_master->wait_for_catchup($node_replica);
@@ -97,8 +99,8 @@ isnt($catalog_xmin, '', "physical catalog_xmin not null");
 is($xmin, '', "logical xmin null");
 isnt($catalog_xmin, '', "logical catalog_xmin not null");
 
-$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
-$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
 
 $node_master->wait_for_catchup($node_replica);
 sleep(2); # ensure walreceiver feedback sent
@@ -110,7 +112,7 @@ isnt($catalog_xmin, '', "physical catalog_xmin not null");
 $node_master->wait_for_catchup($node_replica);
 sleep(2); # ensure walreceiver feedback sent
 
-($ret, $stdout, $stderr) = $node_replica->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
 is($ret, 0, 'replay from slot succeeded');
 is($stdout, q{BEGIN
 table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
@@ -133,11 +135,11 @@ isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
 # we hold down xmin.
 for my $i (0 .. 1000)
 {
-    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
 }
-$node_master->safe_psql('postgres', 'VACUUM');
+$node_master->safe_psql('testdb', 'VACUUM');
 
-($ret, $stdout, $stderr) = $node_replica->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
 is($ret, 0, 'replay of big series succeeded');
 
 $node_master->wait_for_catchup($node_replica);
@@ -158,7 +160,9 @@ isnt($new_physical_xmin, '', "physical xmin not null");
 isnt($new_physical_xmin, $physical_xmin, "physical xmin changed");
 isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
 
-$node_replica->psql('postgres', q[SELECT pg_drop_replication_slot('standby_logical')]);
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
 
 $node_master->wait_for_catchup($node_replica);
 sleep(2); # ensure walreceiver feedback sent
@@ -166,3 +170,87 @@ sleep(2); # ensure walreceiver feedback sent
 ($xmin, $catalog_xmin) = print_phys_xmin();
 isnt($xmin, '', "physical xmin not null");
 is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+diag "Testing dropdb when downstream slot is not in-use";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot']);
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot']);
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica);
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+my $handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+$handle->reap_nb;
+$handle->pump;
+
+if (!$handle->pumpable)
+{
+	$handle->finish;
+	BAIL_OUT("pg_recvlogical already exited with " . (($handle->results())[0]) . " and stderr '$stderr'");
+}
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+# replication won't catch up, we'll error on apply while the slot is in use
+# TODO check for error
+
+$node_master->wait_for_catchup($node_replica);
+
+sleep(1);
+
+# our client should've terminated
+do {
+	local $@;
+	eval {
+		$handle->finish;
+	};
+	my $return = $?;
+	my $save_exc = $@;
+	if ($@) {
+		diag "pg_recvlogical terminated with $? and stderr '$stderr'";	
+		is($return, 1, "pg_recvlogical terminated by server");
+	}
+	else
+	{
+		fail("pg_recvlogical not terminated? $save_exc");
+	}
+};
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

0017-Allow-walsender-to-exit-on-conflict-with-recovery.patchtext/x-patch; charset=US-ASCII; name=0017-Allow-walsender-to-exit-on-conflict-with-recovery.patchDownload
From 8585ca764dc305164dbe6368d7ed505539f01a49 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Fri, 18 Nov 2016 10:24:55 +0800
Subject: [PATCH 17/21] Allow walsender to exit on conflict with recovery

Now that logical decoding on standby is supported, the walsender needs to be
able to exit in response to conflict with recovery so that it can terminate to
allow replay of a DROP DATABASE to proceed.

Does not deal with recovery conflicts due to vacuum activity.

WIP:

* The comments on RecoveryConflictInterrupt() still say it's only called
  by normal user backends.

* There's no safeguard to stop walsender from invoking other recovery conflict
  clauses that may be unsafe for it to call.

* We'll try to clobber walsender sessions that conflict with recovery based
  on vacuum activity to non-catalog, non-user-catalog relations where it's safe
  to continue decoding. We need to treat decoding backends differently and only
  clobber them when we have to invalidate based on satisfying catalog
  requirements.

* A logical decoding session in the walsender often won't have a vtxid or will
  change xids too fast for ResolveRecoveryConflictWithVirtualXIDs to do its
  job. We need to detect when catalog_xmin can't be satisfied when starting to
  process a new xact in walsender decoding.
---
 src/backend/replication/walsender.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 327dbb2..65b38a2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -187,7 +187,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -2666,17 +2665,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2710,7 +2698,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
-- 
2.5.5

0018-Tests-for-db-drop-during-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0018-Tests-for-db-drop-during-decoding-on-standby.patchDownload
From 2a835ee02774d1daf12ce01895e0621c1d880496 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Fri, 18 Nov 2016 11:01:41 +0800
Subject: [PATCH 18/21] Tests for db drop during decoding on standby

---
 .../recovery/t/010_logical_decoding_on_replica.pl  | 46 +++++++++++-----------
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
index 7934e9f..8d5f4f0 100644
--- a/src/test/recovery/t/010_logical_decoding_on_replica.pl
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 43;
+use Test::More tests => 44;
 use RecursiveCopy;
 use File::Copy;
 
@@ -25,6 +25,8 @@ $node_master->append_conf('postgresql.conf', "log_error_verbosity = verbose\n");
 $node_master->append_conf('postgresql.conf', "hot_standby_feedback = on\n");
 # send status rapidly so we promptly advance xmin on master
 $node_master->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+# very promptly terminate conflicting backends
+$node_master->append_conf('postgresql.conf', "max_standby_streaming_delay = '2s'\n");
 $node_master->dump_info;
 $node_master->start;
 
@@ -205,7 +207,7 @@ is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslo
 
 # make sure the slot is in use
 diag "starting pg_recvlogical";
-my $handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--start'], '>', \$stdout, '2>', \$stderr);
+my $handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
 sleep(1);
 $handle->reap_nb;
 $handle->pump;
@@ -225,31 +227,31 @@ diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'ac
 $node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
 ok(1, 'dropdb finished');
 
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+my $return = $?;
+if ($return) {
+	diag "pg_recvlogical terminated with $return and stderr '$stderr'";
+	is($return, 256, "pg_recvlogical terminated by server");
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
 # replication won't catch up, we'll error on apply while the slot is in use
 # TODO check for error
 
 $node_master->wait_for_catchup($node_replica);
 
-sleep(1);
-
-# our client should've terminated
-do {
-	local $@;
-	eval {
-		$handle->finish;
-	};
-	my $return = $?;
-	my $save_exc = $@;
-	if ($@) {
-		diag "pg_recvlogical terminated with $? and stderr '$stderr'";	
-		is($return, 1, "pg_recvlogical terminated by server");
-	}
-	else
-	{
-		fail("pg_recvlogical not terminated? $save_exc");
-	}
-};
-
 is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
   'database dropped on standby');
 
-- 
2.5.5

#4Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#3)
Re: Logical decoding on standby

On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:

I'm currently looking at making detection of replay conflict with a
slot work by separating the current catalog_xmin into two effective
parts - the catalog_xmin currently needed by any known slots
(ProcArray->replication_slot_catalog_xmin, as now), and the oldest
actually valid catalog_xmin where we know we haven't removed anything
yet.

OK, more detailed plan.

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs. We don't
emit a WAL record when we advance oldestXid in
SetTransactionIdLimit(), and doing so is useless because vacuum will
have already removed needed tuples from needed catalogs before calling
SetTransactionIdLimit() from vac_truncate_clog(). We know that if
oldestXid is n, the true valid catalog_xmin where no needed tuples
have been removed must be >= n. But we need to know the lower bound of
valid catalog_xmin, which oldestXid doesn't give us.

So right now a standby has no way to reliably know if the catalog_xmin
requirement for a given replication slot can be satisfied. A standby
can't tell based on a xl_heap_cleanup_info record, xl_heap_clean
record, etc whether the affected table is a catalog or not, and
shouldn't generate conflicts for non-catalogs since otherwise it'll be
constantly clobbering walsenders.

A 2-phase advance of the global catalog_xmin would mean that
GetOldestXmin() would return a value from ShmemVariableCache, not the
oldest catalog xmin from ProcArray like it does now. (auto)vacuum
would then be responsible for:

* Reading the oldest catalog_xmin from procarray
* If it has advanced vs what's present in ShmemVariableCache, writing
a new xlog record type recording an advance of oldest catalog xmin
* advancing ShmemVariableCache's oldest catalog xmin

and would do so before it called GetOldestXmin via
vacuum_set_xid_limits() to determine what it can remove.

GetOldestXmin would return the ProcArray's copy of the oldest
catalog_xmin when in recovery, so we report it via hot_standby_fedback
to the upstream, it's recorded on our physical slot, and in turn
causes vacuum to advance the master's effective oldest catalog_xmin
for vacuum.

On the standby we'd replay the catalog_xmin advance record, advance
the standby's ShmemVariableCache's oldest catalog xmin, and check to
see if any replication slots, active or not, have a catalog_xmin <
than the new threshold. If none do, there's no conflict and we're
fine. If any do, we wait
max_standby_streaming_delay/max_standby_archive_delay as appropriate,
then generate recovery conflicts against all backends that have an
active replication slot based on the replication slot state in shmem.
Those backends - walsender or normal decoding backend - would promptly
die. New decoding sessions will check the ShmemVariableCache and
refuse to start if their catalog_xmin is < the threshold. Since we
advance it before generating recovery conflicts there's no race with
clients trying to reconnect after their backend is killed with a
conflict.

If we wanted to get fancy we could set the latches of walsender
backends at risk of conflicting and they could check
ShmemVariableCache's oldest valid catalog xmin, so they could send
immediate keepalives with reply_requested set and hopefully get flush
confirmation from the client and advance their catalog_xmin before we
terminate them as conflicting with recovery. But that can IMO be done
later/separately.

Going to prototype this.

Alternate approach:
---------------

The oldest xid in heap_xlog_cleanup_info is relation-specific, but the
standby has no way to know if it's a catalog relation or not during
redo and know whether to kill slots and decoding sessions based on its
latestRemovedXid. Same for xl_heap_clean and the other records that
can cause snapshot conflicts (xl_xlog_visible, xl_heap_freeze_page,
xl_btree_delete xl_btree_reuse_page, spgxlogVacuumRedirect).

Instead of adding a 2-phase advance of the global catalog_xmin, we can
instead add a rider to each of these records that identifies whether
it's a catalog table or not. This would only be emitted when wal_level

= logical, but it *would* increase WAL sizes a bit when logical

decoding is enabled even if it's not going to be used on a standby.
The rider would be a simple:

typedef struct xl_rel_catalog_info
{
bool rel_accessible_from_logical_decoding;
} xl_catalog_info;

or similar. During redo we call a new
ResolveRecoveryConflictWithLogicalSlot function from each of those
records' redo routines that does what I outlined above.

This way add more info to more xlog records, and the upstream has to
use RelationIsAccessibleInLogicalDecoding() to set up the records when
writing the xlogs. In exchange, we don't have to add a new field to
CheckPoint or ShmemVariableCache or add a new xlog record type. It
seems the worse option to me.

(BTW, as comments on GetOldestSafeDecodingTransactionId() note, we
can't rely on KnownAssignedXidsGetOldestXmin() since it can be
incomplete at least on standby.)

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#4)
Re: Logical decoding on standby

On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:

I'm currently looking at making detection of replay conflict with a
slot work by separating the current catalog_xmin into two effective
parts - the catalog_xmin currently needed by any known slots
(ProcArray->replication_slot_catalog_xmin, as now), and the oldest
actually valid catalog_xmin where we know we haven't removed anything
yet.

OK, more detailed plan.

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs.

Really?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#5)
Re: Logical decoding on standby

On 23 November 2016 at 03:55, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:

I'm currently looking at making detection of replay conflict with a
slot work by separating the current catalog_xmin into two effective
parts - the catalog_xmin currently needed by any known slots
(ProcArray->replication_slot_catalog_xmin, as now), and the oldest
actually valid catalog_xmin where we know we haven't removed anything
yet.

OK, more detailed plan.

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs.

Really?

(Note the double negative above).

Yes, necessarily so. You can't look up xids older than the clog
truncation threshold at oldestXid, per our discussion on txid_status()
and traceable commit. But the tuples from that xact aren't guaranteed
to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)
which calls GetOldestXmin(...); that in turn scans ProcArray to find
the oldest xid any running xact cares about. It might bump it down
further if there's a replication slot requirement or based on
vacuum_defer_cleanup_age, but it doesn't care in the slightest about
oldestXmin.

oldestXmin cannot advance until vacuum has removed all tuples for that
xid and advanced the database's datfrozenxid. But a given oldestXmin
says nothing about which tuples, catalog or otherwise, still exist and
are acessible.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#6)
Re: Logical decoding on standby

On Wed, Nov 23, 2016 at 8:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs.

Really?

(Note the double negative above).

Yes, necessarily so. You can't look up xids older than the clog
truncation threshold at oldestXid, per our discussion on txid_status()
and traceable commit. But the tuples from that xact aren't guaranteed
to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)
which calls GetOldestXmin(...); that in turn scans ProcArray to find
the oldest xid any running xact cares about. It might bump it down
further if there's a replication slot requirement or based on
vacuum_defer_cleanup_age, but it doesn't care in the slightest about
oldestXmin.

oldestXmin cannot advance until vacuum has removed all tuples for that
xid and advanced the database's datfrozenxid. But a given oldestXmin
says nothing about which tuples, catalog or otherwise, still exist and
are acessible.

Right. Sorry, my mistake.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#7)
Re: Logical decoding on standby

On 26 Nov. 2016 23:40, "Robert Haas" <robertmhaas@gmail.com> wrote:

On Wed, Nov 23, 2016 at 8:37 AM, Craig Ringer <craig@2ndquadrant.com>

wrote:

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs.

Really?

(Note the double negative above).

Yes, necessarily so. You can't look up xids older than the clog
truncation threshold at oldestXid, per our discussion on txid_status()
and traceable commit. But the tuples from that xact aren't guaranteed
to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)
which calls GetOldestXmin(...); that in turn scans ProcArray to find
the oldest xid any running xact cares about. It might bump it down
further if there's a replication slot requirement or based on
vacuum_defer_cleanup_age, but it doesn't care in the slightest about
oldestXmin.

oldestXmin cannot advance until vacuum has removed all tuples for that
xid and advanced the database's datfrozenxid. But a given oldestXmin
says nothing about which tuples, catalog or otherwise, still exist and
are acessible.

Right. Sorry, my mistake.

Phew. Had me worried there.

Thanks for looking over it. Prototype looks promising so far.

#9Petr Jelinek
petr@2ndquadrant.com
In reply to: Craig Ringer (#3)
Re: Logical decoding on standby

Hi,

I did look at the code a bit. The first 6 patches seem reasonable.
I don't understand why some patches are separate tbh (like 7-10, or 11).

About the 0009:

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9985e3e..4fa3ad4 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
if (all_visible)
{
/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
}
rel = relation_open(relid, AccessShareLock);
@@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
* a buffer lock. And this shouldn't happen often, so it's
* worth being careful so as to avoid false positives.
*/
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index f524fc4..5b33c97 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
TransactionId OldestXmin;
uint64		misc_count = 0;
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
bstrategy = GetAccessStrategy(BAS_BULKREAD);

nblocks = RelationGetNumberOfBlocks(rel);

This does not seem correct, you are sending false as pointer parameter.

0012:

I think there should be parameter saying if snapshot should be exported
or not and if user asks for it on standby it should fail.

0014 makes 0011 even more pointless.

Not going into deeper detail as this is still very WIP. I go agree with
the general design though.

This also replaces the previous timeline following and decoding
threads/CF entries so maybe those should be closed in CF?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Craig Ringer
craig@2ndquadrant.com
In reply to: Petr Jelinek (#9)
Re: Logical decoding on standby
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
TransactionId OldestXmin;
uint64          misc_count = 0;
-     OldestXmin = GetOldestXmin(rel, true);
+     OldestXmin = GetOldestXmin(rel, true, false);
bstrategy = GetAccessStrategy(BAS_BULKREAD);

nblocks = RelationGetNumberOfBlocks(rel);

This does not seem correct, you are sending false as pointer parameter.

Thanks. That's an oversight from the GetOldestXmin interface change
per your prior feedback. C doesn't care since null is 0 and false is
0, and I missed it when transforming the patch.

0012:

I think there should be parameter saying if snapshot should be exported
or not and if user asks for it on standby it should fail.

Sounds reasonable. That also means clients can suppress standby export
on master, which as we recently learned can be desirable sometimes.

0014 makes 0011 even more pointless.

Yeah, as I said, it's a bit WIP still and needs some rebasing and rearrangement.

This also replaces the previous timeline following and decoding
threads/CF entries so maybe those should be closed in CF?

I wasn't sure what to do about that, since it's all a set of related
functionality. I think it's going to get more traction as a single
"logical decoding onstandby" feature though, since the other parts are
hard to test and use in isolation. So yeah, probably, I'll do so
unless someone objects.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#1)
9 attachment(s)
Re: Logical decoding on standby

On 21 November 2016 at 16:17, Craig Ringer <craig@2ndquadrant.com> wrote:

Hi all

I've prepared a working initial, somewhat raw implementation for
logical decoding on physical standbys.

Hi all

I've attached a significantly revised patch, which now incorporates
safeguards to ensure that we prevent decoding if the master has not
retained needed catalogs and cancel decoding sessions that are holding
up apply because they need too-old catalogs

The biggest change in this patch, and the main intrusive part, is that
procArray->replication_slot_catalog_xmin is no longer directly used by
vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
added, with a corresponding CheckPoint field. Vacuum notices if
procArray->replication_slot_catalog_xmin has advanced past
ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr
record with the new value before it copies it to oldestCatalogXmin.
This means that a standby can now reliably tell when catalogs are
about to be removed or become candidates for removal, so it can pause
redo until logical decoding sessions on the standby advance far enough
that their catalog_xmin passes that point. It also means that if our
hot_standby_feedback somehow fails to lock in the catalogs our slots
need on a standby, we can cancel sessions with a conflict with
recovery.

If wal_level is < logical this won't do anything, since
replication_slot_catalog_xmin and oldestCatalogXmin will both always
be 0.

Because oldestCatalogXmin advances eagerly as soon as vacuum sees the
new replication_slot_catalog_xmin, this won't impact catalog bloat.

Ideally this mechanism won't generally actually be needed, since
hot_standby_feedback stops the master from removing needed catalogs,
and we make an effort to ensure that the standby has
hot_standby_feedback enabled and is using a replication slot. We
cannot prevent the user from dropping and re-creating the physical
slot on the upstream, though, and it doesn't look simple to stop them
turning off hot_standby_feedback or turning off use of a physical slot
after creating logical slots, either. So we try to stop users shooting
themselves in the foot, but if they do it anyway we notice and cope
gracefully. Logging catalog_xmin also helps slots created on standbys
know where to start, and makes sure we can deal gracefully with a race
between hs_feedback and slot creation on a standby.

There can be a significant delay for slot creation on standby since we
have to wait until there's a new xl_running_xacts record logged. I'd
like to extend the hot_standby_feedback protocol a little to address
that and some other issues, but that's a separate step.

I haven't addressed Petr's point yet, that "there should be parameter
saying if snapshot should be exported
or not and if user asks for it on standby it should fail". Otherwise I
think it's looking fairly solid.

Due to the amount of churn I landed up flattening the patchset. It
probably makes sense to split it up, likely into the sequence of
changes listed in the commit message. I'll wait for a general opinion
on the validity of this approach first.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-PostgresNode-methods-to-wait-for-node-catchup.patchtext/x-patch; charset=US-ASCII; name=0001-PostgresNode-methods-to-wait-for-node-catchup.patchDownload
From 60ba9a48992fe16afef4d481d45def5620f002ed Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 14 Nov 2016 12:27:17 +0800
Subject: [PATCH 1/9] PostgresNode methods to wait for node catchup

---
 src/test/perl/PostgresNode.pm         | 120 +++++++++++++++++++++++++++++++++-
 src/test/recovery/t/001_stream_rep.pl |  12 +---
 2 files changed, 120 insertions(+), 12 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index c1b16ca..28e9f0b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -93,6 +93,7 @@ use RecursiveCopy;
 use Socket;
 use Test::More;
 use TestLib ();
+use pg_lsn qw(parse_lsn);
 use Scalar::Util qw(blessed);
 
 our @EXPORT = qw(
@@ -1121,7 +1122,6 @@ sub psql
 		my $exc_save = $@;
 		if ($exc_save)
 		{
-
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
@@ -1173,7 +1173,7 @@ sub psql
 		  if $ret == 1;
 		die "connection error: '$$stderr'\nwhile running '@psql_params'"
 		  if $ret == 2;
-		die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
+		die "error running SQL: '$$stderr'\nwhile running '@psql_params' with sql '$sql'"
 		  if $ret == 3;
 		die "psql returns $ret: '$$stderr'\nwhile running '@psql_params'";
 	}
@@ -1325,6 +1325,122 @@ sub run_log
 	TestLib::run_log(@_);
 }
 
+=pod $node->lsn
+
+Return pg_current_xlog_insert_location() or, on a replica,
+pg_last_xlog_replay_location().
+
+=cut
+
+sub lsn
+{
+	my $self = shift;
+	return $self->safe_psql('postgres', 'select case when pg_is_in_recovery() then pg_last_xlog_replay_location() else pg_current_xlog_insert_location() end as lsn;');
+}
+
+=pod $node->wait_for_catchup(standby_name, mode, target_lsn)
+
+Wait for the node with application_name standby_name (usually from node->name)
+until its replication equals or passes the upstream's xlog insert point at the
+time this function is called. By default the replay_location is waited for,
+but 'mode' may be specified to wait for any of sent|write|flush|replay.
+
+If there is no active replication connection from this peer, waits until
+poll_query_until timeout.
+
+Requires that the 'postgres' db exists and is accessible.
+
+If pos is passed, use that xlog position instead of the server's current
+xlog insert position.
+
+This is not a test. It die()s on failure.
+
+Returns the LSN caught up to.
+
+=cut
+
+sub wait_for_catchup
+{
+	my ($self, $standby_name, $mode, $target_lsn) = @_;
+	$mode = defined($mode) ? $mode : 'replay';
+	my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1 );
+	die "valid modes are " . join(', ', keys(%valid_modes)) unless exists($valid_modes{$mode});
+	if ( blessed( $standby_name ) && $standby_name->isa("PostgresNode") ) {
+		$standby_name = $standby_name->name;
+	}
+	if (!defined($target_lsn)) {
+		$target_lsn = $self->lsn;
+	}
+	$self->poll_query_until('postgres', qq[SELECT '$target_lsn' <= ${mode}_location FROM pg_catalog.pg_stat_replication WHERE application_name = '$standby_name';])
+		or die "timed out waiting for catchup";
+	return $target_lsn;
+}
+
+=pod $node->wait_for_slot_catchup(slot_name, mode, target_lsn)
+
+Wait for the named replication slot to equal or pass the xlog position of the
+server, or the supplied target_lsn if given. The position used is the
+restart_lsn unless mode is given, in which case it may be 'restart' or
+'confirmed_flush'.
+
+Requires that the 'postgres' db exists and is accessible.
+
+This is not a test. It die()s on failure.
+
+If the slot is not active, will time out after poll_query_until's timeout.
+
+Note that for logical slots, restart_lsn is held down by the oldest in progress tx.
+
+Returns the LSN caught up to.
+
+=cut
+
+sub wait_for_slot_catchup
+{
+	my ($self, $slot_name, $mode, $target_lsn) = @_;
+	$mode = defined($mode) ? $mode : 'restart';
+	if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
+		die "valid modes are restart, confirmed_flush";
+	}
+	if (!defined($target_lsn)) {
+		$target_lsn = $self->lsn;
+	}
+	$self->poll_query_until('postgres', qq[SELECT '$target_lsn' <= ${mode}_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name';])
+		or die "timed out waiting for catchup";
+	return $target_lsn;
+}
+
+=pod $node->slot(slot_name)
+
+Return hash-ref of replication slot data for the named slot, or a hash-ref with
+all values '' if not found. Does not differentiate between null and empty string
+for fields, no field is ever undef.
+
+The restart_lsn and confirmed_flush_lsn fields are returned verbatim, and also
+as a 2-list of [highword, lowword] integer. Since we rely on Perl 5.8.8 we can't
+"use bigint", it's from 5.20, and we can't assume we have Math::Bigint from CPAN
+either.
+
+=cut
+
+sub slot
+{
+	my ($self, $slot_name) = @_;
+	my @fields = ('plugin', 'slot_type', 'datoid', 'database', 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+	my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ', @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'");
+	$result = undef if $result eq '';
+	# hash slice, see http://stackoverflow.com/a/16755894/398670 .
+	#
+	# Fills the hash with empty strings produced by x-operator element
+	# duplication if result is an empty row
+	#
+	my %val;
+	@val{@fields} = $result ne '' ? split(qr/\|/, $result) : ('',) x scalar(@fields);
+	$val{'restart_lsn_arr'} = parse_lsn($val{'restart_lsn'});
+	$val{'confirmed_flush_lsn_arr'} = parse_lsn($val{'confirmed_flush_lsn'});
+	return \%val;
+}
+
 =pod
 
 =back
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 981c00b..5ce69bb 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -40,16 +40,8 @@ $node_master->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
 
 # Wait for standbys to catch up
-my $applname_1 = $node_standby_1->name;
-my $applname_2 = $node_standby_2->name;
-my $caughtup_query =
-"SELECT pg_current_xlog_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_1';";
-$node_master->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby 1 to catch up";
-$caughtup_query =
-"SELECT pg_last_xlog_replay_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_2';";
-$node_standby_1->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby 2 to catch up";
+$node_master->wait_for_catchup($node_standby_1);
+$node_standby_1->wait_for_catchup($node_standby_2);
 
 my $result =
   $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-- 
2.5.5

0002-Create-new-pg_lsn-class-to-deal-with-awkward-LSNs-in.patchtext/x-patch; charset=US-ASCII; name=0002-Create-new-pg_lsn-class-to-deal-with-awkward-LSNs-in.patchDownload
From 9de720c406300eb0c5a31e6829963b70fa14508d Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 14 Nov 2016 12:19:35 +0800
Subject: [PATCH 2/9] Create new pg_lsn class to deal with awkward LSNs in
 tests

---
 src/test/perl/Makefile        |   3 +
 src/test/perl/pg_lsn.pm       | 144 ++++++++++++++++++++++++++++++++++++++++++
 src/test/perl/t/001_load.pl   |   9 +++
 src/test/perl/t/002_pg_lsn.pl |  68 ++++++++++++++++++++
 4 files changed, 224 insertions(+)
 create mode 100644 src/test/perl/pg_lsn.pm
 create mode 100644 src/test/perl/t/001_load.pl
 create mode 100644 src/test/perl/t/002_pg_lsn.pl

diff --git a/src/test/perl/Makefile b/src/test/perl/Makefile
index 8ab60fc..cdc38f4 100644
--- a/src/test/perl/Makefile
+++ b/src/test/perl/Makefile
@@ -15,6 +15,9 @@ include $(top_builddir)/src/Makefile.global
 
 ifeq ($(enable_tap_tests),yes)
 
+check:
+	$(prove_check)
+
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(pgxsdir)/$(subdir)'
 
diff --git a/src/test/perl/pg_lsn.pm b/src/test/perl/pg_lsn.pm
new file mode 100644
index 0000000..777b3df
--- /dev/null
+++ b/src/test/perl/pg_lsn.pm
@@ -0,0 +1,144 @@
+package pg_lsn;
+
+use strict;
+use warnings;
+
+our (@ISA, @EXPORT_OK);
+BEGIN {
+	require Exporter;
+	@ISA = qw(Exporter);
+	@EXPORT_OK = qw(parse_lsn);
+}
+
+use Scalar::Util qw(blessed looks_like_number);
+use Carp;
+
+use overload
+	'""' => \&Str,
+	'<=>' => \&NumCmp,
+	'bool' => \&Bool,
+	'-' => \&Negate,
+	fallback => 1;
+
+=pod package pg_lsn
+
+A class to encapsulate a PostgreSQL log-sequence number (LSN) and handle conversion
+of its hex representation.
+
+Provides equality and inequality operators.
+
+Calling 'new' on undef or empty string argument returns undef, not an instance.
+
+=cut
+
+sub new_num
+{
+	my ($class, $high, $low) = @_;
+	my $self = bless { '_low' => $low, '_high' => $high } => $class;
+	$self->_constraint;
+	return $self;
+}
+
+sub new
+{
+	my ($class, $lsn_str) = @_;
+	return undef if !defined($lsn_str) || $lsn_str eq '';
+	my ($high, $low) = split('/', $lsn_str, 2);
+	die "malformed LSN" if ($high eq '' || $low eq '');
+	return $class->new_num(hex($high), hex($low));
+}
+
+sub NumCmp
+{
+	my ($self, $other, $swap) = @_;
+	$self->_constraint;
+	die "comparison with undef" unless defined($other);
+	if (!blessed($other))
+	{
+		# coerce from string if needed. Try to coerce any non-object.
+		$other = pg_lsn->new($other) if !blessed($other);
+	}
+	$other->_constraint;
+	# and compare
+	my $ret;
+	if ($self->{'_high'} < $other->{'_high'})
+	{
+		$ret = -1;
+	}
+	elsif ($self->{'_high'} == $other->{'_high'})
+	{
+		if ($self->{'_low'} < $other->{'_low'})
+		{
+			$ret = -1;
+		}
+		elsif ($self->{'_low'} == $other->{'_low'})
+		{
+			$ret = 0;
+		}
+		else
+		{
+			$ret = 1;
+		}
+	}
+	else
+	{
+		$ret = 1;
+	}
+	$ret = -$ret if $swap;
+	return $ret;
+}
+
+sub _constraint
+{
+	my $self = shift;
+	die "high word must be defined" unless (defined($self->{'_high'}));
+	die "high word must be numeric" unless (looks_like_number($self->{'_high'}));
+	die "high word must be in uint32 range" unless ($self->{'_high'} >= 0 && $self->{'_high'} <= 0xFFFFFFFF);
+	die "low word must be defined" unless (defined($self->{'_low'}));
+	die "low word must be numeric" unless (looks_like_number($self->{'_low'}));
+	die "low word must be in uint32 range" unless ($self->{'_low'} >= 0 && $self->{'_low'} <= 0xFFFFFFFF);
+}
+
+sub Bool
+{
+	my $self = shift;
+	$self->_constraint;
+	return $self->{'_high'} || $self->{'_low'};
+}
+
+sub Negate
+{
+	die "cannot negate pg_lsn";
+}
+
+sub Str
+{
+	my $self = shift;
+	return sprintf("%X/%X", $self->high, $self->low);
+}
+
+sub high
+{
+	my $self = shift;
+	return $self->{'_high'};
+}
+
+sub low
+{
+	my $self = shift;
+	return $self->{'_low'};
+}
+
+# Todo: addition/subtraction. Needs to handle wraparound and carrying.
+
+=pod parse_lsn(lsn)
+
+Returns a 2-array of the high and low words of the passed LSN as numbers,
+or undef if argument is the empty string or undef.
+
+=cut 
+
+sub parse_lsn
+{
+	return pg_lsn->new($_[0]);
+}
diff --git a/src/test/perl/t/001_load.pl b/src/test/perl/t/001_load.pl
new file mode 100644
index 0000000..53a39af
--- /dev/null
+++ b/src/test/perl/t/001_load.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+use Test::More tests => 5;
+
+require_ok 'RecursiveCopy';
+require_ok 'SimpleTee';
+require_ok 'TestLib';
+require_ok 'PostgresNode';
+require_ok 'pg_lsn';
diff --git a/src/test/perl/t/002_pg_lsn.pl b/src/test/perl/t/002_pg_lsn.pl
new file mode 100644
index 0000000..73e3d65
--- /dev/null
+++ b/src/test/perl/t/002_pg_lsn.pl
@@ -0,0 +1,68 @@
+use strict;
+use warnings;
+use Test::More tests => 42;
+use Scalar::Util qw(blessed);
+
+use pg_lsn qw(parse_lsn);
+
+ok(!defined(parse_lsn('')), 'parse_lsn of empty string is undef');
+ok(!defined(parse_lsn(undef)), 'parse_lsn of undef is undef');
+
+my $zero_lsn = parse_lsn('0/0');
+ok(blessed($zero_lsn), 'zero lsn blessed');
+ok($zero_lsn->isa("pg_lsn"), 'zero lsn isa pg_lsn');
+is($zero_lsn->{'_high'}, 0, 'zero lsn high word zero');
+is($zero_lsn->{'_low'}, 0, 'zero lsn low word zero');
+cmp_ok($zero_lsn, "==", pg_lsn->new_num(0, 0), 'parse_lsn of 0/0');
+
+cmp_ok(parse_lsn('0/FFFFFFFF'), "==", pg_lsn->new_num(0, 0xFFFFFFFF), 'parse_lsn of 0/FFFFFFFF');
+cmp_ok(parse_lsn('FFFFFFFF/0'), "==", pg_lsn->new_num(0xFFFFFFFF, 0), 'parse_lsn of FFFFFFFF/0');
+cmp_ok(parse_lsn('FFFFFFFF/FFFFFFFF'), "==", pg_lsn->new_num(0xFFFFFFFF, 0xFFFFFFFF), 'parse_lsn of 0xFFFFFFFF/0xFFFFFFFF');
+
+is(parse_lsn('2/2') <=> parse_lsn('2/3'), -1);
+is(parse_lsn('2/2') <=> parse_lsn('2/2'), 0);
+is(parse_lsn('2/2') <=> parse_lsn('2/1'), 1);
+is(parse_lsn('2/2') <=> parse_lsn('3/2'), -1);
+is(parse_lsn('2/2') <=> parse_lsn('1/2'), 1);
+
+cmp_ok(parse_lsn('0/1'), "==", parse_lsn('0/1'));
+ok(!(parse_lsn('0/1') == parse_lsn('0/2')), "! 0/1 == 0/2");
+ok(!(parse_lsn('0/1') == parse_lsn('0/0')), "! 0/1 == 0/0");
+cmp_ok(parse_lsn('1/0'), "==", parse_lsn('1/0'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('1/1'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('2/0'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('0/0'));
+cmp_ok(parse_lsn('1/0'), "!=", parse_lsn('0/1'));
+
+cmp_ok(parse_lsn('0/1'), ">=", parse_lsn('0/1'));
+cmp_ok(parse_lsn('0/1'), "<=", parse_lsn('0/1'));
+cmp_ok(parse_lsn('0/1'), "<=", parse_lsn('0/2'));
+cmp_ok(parse_lsn('0/1'), ">=", parse_lsn('0/0'));
+cmp_ok(parse_lsn('1/0'), ">=", parse_lsn('1/0'));
+cmp_ok(parse_lsn('1/0'), "<=", parse_lsn('1/0'));
+cmp_ok(parse_lsn('1/0'), "<=", parse_lsn('2/0'));
+cmp_ok(parse_lsn('1/0'), ">=", parse_lsn('0/0'));
+cmp_ok(parse_lsn('1/1'), ">=", parse_lsn('1/1'));
+cmp_ok(parse_lsn('1/1'), "<=", parse_lsn('1/1'));
+cmp_ok(parse_lsn('1/1'), "<=", parse_lsn('1/2'));
+cmp_ok(parse_lsn('1/2'), ">=", parse_lsn('1/1'));
+
+ok(parse_lsn('1/1'), 'bool conversion');
+ok(! $zero_lsn, 'bool negation');
+
+# implicit string conversions
+cmp_ok(parse_lsn('0/0'), "==", "0/0");
+cmp_ok(parse_lsn('FFFFFFFF/FFFFFFFF'), "==", "FFFFFFFF/FFFFFFFF");
+# swapped string conversions
+cmp_ok("0/0", "==", parse_lsn('0/0'));
+cmp_ok("FFFFFFFF/FFFFFFFF", "==", parse_lsn('FFFFFFFF/FFFFFFFF'));
+
+# negation makes no sense for a uint64
+eval {
+	- parse_lsn('0/1');
+};
+if ($@) {
+	ok('negation raises error');
+} else {
+	fail('negation did not raise error');
+}
-- 
2.5.5

0003-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patchtext/x-patch; charset=UTF-8; name=0003-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patchDownload
From 58b077b0492f81598d167a8e788a5361494c820e Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 12:37:40 +0800
Subject: [PATCH 3/9] Add an optional --endpos LSN argument to pg_recvlogical
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

pg_recvlogical usually just runs until cancelled or until the upstream
server disconnects. For some purposes, especially testing, it's useful
to have the ability to stop receive at a specified LSN without having
to parse the output and deal with buffering issues, etc.

Add a --endpos parameter that takes the LSN at which no further
messages should be written and receive should stop.

Craig Ringer, Álvaro Herrera
---
 doc/src/sgml/ref/pg_recvlogical.sgml   |  34 ++++++++
 src/bin/pg_basebackup/pg_recvlogical.c | 145 +++++++++++++++++++++++++++++----
 2 files changed, 164 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index b35881f..d066ce8 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -38,6 +38,14 @@ PostgreSQL documentation
    constraints as <xref linkend="app-pgreceivexlog">, plus those for logical
    replication (see <xref linkend="logicaldecoding">).
   </para>
+
+  <para>
+   <command>pg_recvlogical</> has no equivalent to the logical decoding
+   SQL interface's peek and get modes. It sends replay confirmations for
+   data lazily as it receives it and on clean exit. To examine pending data on
+    a slot without consuming it, use
+   <link linkend="functions-replication"><function>pg_logical_slot_peek_changes</></>.
+  </para>
  </refsect1>
 
  <refsect1>
@@ -155,6 +163,32 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-E <replaceable>lsn</replaceable></option></term>
+      <term><option>--endpos=<replaceable>lsn</replaceable></option></term>
+      <listitem>
+       <para>
+        In <option>--start</option> mode, automatically stop replication
+        and exit with normal exit status 0 when receiving reaches the
+        specified LSN.  If specified when not in <option>--start</option>
+        mode, an error is raised.
+       </para>
+
+       <para>
+        If there's a record with LSN exactly equal to <replaceable>lsn</>,
+        the record will be output.
+       </para>
+
+       <para>
+        The <option>--endpos</option> option is not aware of transaction
+        boundaries and may truncate output partway through a transaction.
+        Any partially output transaction will not be consumed and will be
+        replayed again when the slot is next read from. Individual messages
+        are never truncated.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>--if-not-exists</option></term>
       <listitem>
        <para>
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index cb5f989..c700edf 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -40,6 +40,7 @@ static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
 static XLogRecPtr startpos = InvalidXLogRecPtr;
+static XLogRecPtr endpos = InvalidXLogRecPtr;
 static bool do_create_slot = false;
 static bool slot_exists_ok = false;
 static bool do_start_slot = false;
@@ -63,6 +64,9 @@ static XLogRecPtr output_fsync_lsn = InvalidXLogRecPtr;
 static void usage(void);
 static void StreamLogicalLog(void);
 static void disconnect_and_exit(int code);
+static bool flushAndSendFeedback(PGconn *conn, TimestampTz *now);
+static void prepareToTerminate(PGconn *conn, XLogRecPtr endpos,
+				   bool keepalive, XLogRecPtr lsn);
 
 static void
 usage(void)
@@ -81,6 +85,7 @@ usage(void)
 			 "                         time between fsyncs to the output file (default: %d)\n"), (fsync_interval / 1000));
 	printf(_("      --if-not-exists    do not error if slot already exists when creating a slot\n"));
 	printf(_("  -I, --startpos=LSN     where in an existing slot should the streaming start\n"));
+	printf(_("  -E, --endpos=LSN       exit after receiving the specified LSN\n"));
 	printf(_("  -n, --no-loop          do not loop on connection lost\n"));
 	printf(_("  -o, --option=NAME[=VALUE]\n"
 			 "                         pass option NAME with optional value VALUE to the\n"
@@ -281,6 +286,7 @@ StreamLogicalLog(void)
 		int			bytes_written;
 		int64		now;
 		int			hdr_len;
+		XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
 
 		if (copybuf != NULL)
 		{
@@ -454,6 +460,7 @@ StreamLogicalLog(void)
 			int			pos;
 			bool		replyRequested;
 			XLogRecPtr	walEnd;
+			bool		endposReached = false;
 
 			/*
 			 * Parse the keepalive message, enclosed in the CopyData message.
@@ -476,18 +483,32 @@ StreamLogicalLog(void)
 			}
 			replyRequested = copybuf[pos];
 
-			/* If the server requested an immediate reply, send one. */
-			if (replyRequested)
+			if (endpos != InvalidXLogRecPtr && walEnd >= endpos)
 			{
-				/* fsync data, so we send a recent flush pointer */
-				if (!OutputFsync(now))
-					goto error;
+				/*
+				 * If there's nothing to read on the socket until a keepalive
+				 * we know that the server has nothing to send us; and if
+				 * walEnd has passed endpos, we know nothing else can have
+				 * committed before endpos.  So we can bail out now.
+				 */
+				endposReached = true;
+			}
 
-				now = feGetCurrentTimestamp();
-				if (!sendFeedback(conn, now, true, false))
+			/* Send a reply, if necessary */
+			if (replyRequested || endposReached)
+			{
+				if (!flushAndSendFeedback(conn, &now))
 					goto error;
 				last_status = now;
 			}
+
+			if (endposReached)
+			{
+				prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr);
+				time_to_abort = true;
+				break;
+			}
+
 			continue;
 		}
 		else if (copybuf[0] != 'w')
@@ -497,7 +518,6 @@ StreamLogicalLog(void)
 			goto error;
 		}
 
-
 		/*
 		 * Read the header of the XLogData message, enclosed in the CopyData
 		 * message. We only need the WAL location field (dataStart), the rest
@@ -515,12 +535,23 @@ StreamLogicalLog(void)
 		}
 
 		/* Extract WAL location for this block */
-		{
-			XLogRecPtr	temp = fe_recvint64(&copybuf[1]);
+		cur_record_lsn = fe_recvint64(&copybuf[1]);
 
-			output_written_lsn = Max(temp, output_written_lsn);
+		if (endpos != InvalidXLogRecPtr && cur_record_lsn > endpos)
+		{
+			/*
+			 * We've read past our endpoint, so prepare to go away being
+			 * cautious about what happens to our output data.
+			 */
+			if (!flushAndSendFeedback(conn, &now))
+				goto error;
+			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			time_to_abort = true;
+			break;
 		}
 
+		output_written_lsn = Max(cur_record_lsn, output_written_lsn);
+
 		bytes_left = r - hdr_len;
 		bytes_written = 0;
 
@@ -557,10 +588,29 @@ StreamLogicalLog(void)
 					strerror(errno));
 			goto error;
 		}
+
+		if (endpos != InvalidXLogRecPtr && cur_record_lsn == endpos)
+		{
+			/* endpos was exactly the record we just processed, we're done */
+			if (!flushAndSendFeedback(conn, &now))
+				goto error;
+			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			time_to_abort = true;
+			break;
+		}
 	}
 
 	res = PQgetResult(conn);
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	if (PQresultStatus(res) == PGRES_COPY_OUT)
+	{
+		/*
+		 * We're doing a client-initiated clean exit and have sent CopyDone to
+		 * the server. We've already sent replay confirmation and fsync'd so
+		 * we can just clean up the connection now.
+		 */
+		goto error;
+	}
+	else if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
 		fprintf(stderr,
 				_("%s: unexpected termination of replication stream: %s"),
@@ -638,6 +688,7 @@ main(int argc, char **argv)
 		{"password", no_argument, NULL, 'W'},
 /* replication options */
 		{"startpos", required_argument, NULL, 'I'},
+		{"endpos", required_argument, NULL, 'E'},
 		{"option", required_argument, NULL, 'o'},
 		{"plugin", required_argument, NULL, 'P'},
 		{"status-interval", required_argument, NULL, 's'},
@@ -673,7 +724,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "f:F:nvd:h:p:U:wWI:E:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -733,6 +784,16 @@ main(int argc, char **argv)
 				}
 				startpos = ((uint64) hi) << 32 | lo;
 				break;
+			case 'E':
+				if (sscanf(optarg, "%X/%X", &hi, &lo) != 2)
+				{
+					fprintf(stderr,
+							_("%s: could not parse end position \"%s\"\n"),
+							progname, optarg);
+					exit(1);
+				}
+				endpos = ((uint64) hi) << 32 | lo;
+				break;
 			case 'o':
 				{
 					char	   *data = pg_strdup(optarg);
@@ -857,6 +918,16 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (endpos != InvalidXLogRecPtr && !do_start_slot)
+	{
+		fprintf(stderr,
+				_("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
+				progname);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -923,8 +994,8 @@ main(int argc, char **argv)
 		if (time_to_abort)
 		{
 			/*
-			 * We've been Ctrl-C'ed. That's not an error, so exit without an
-			 * errorcode.
+			 * We've been Ctrl-C'ed or reached an exit limit condition. That's
+			 * not an error, so exit without an errorcode.
 			 */
 			disconnect_and_exit(0);
 		}
@@ -943,3 +1014,47 @@ main(int argc, char **argv)
 		}
 	}
 }
+
+/*
+ * Fsync our output data, and send a feedback message to the server.  Returns
+ * true if successful, false otherwise.
+ *
+ * If successful, *now is updated to the current timestamp just before sending
+ * feedback.
+ */
+static bool
+flushAndSendFeedback(PGconn *conn, TimestampTz *now)
+{
+	/* flush data to disk, so that we send a recent flush pointer */
+	if (!OutputFsync(*now))
+		return false;
+	*now = feGetCurrentTimestamp();
+	if (!sendFeedback(conn, *now, true, false))
+		return false;
+
+	return true;
+}
+
+/*
+ * Try to inform the server about of upcoming demise, but don't wait around or
+ * retry on failure.
+ */
+static void
+prepareToTerminate(PGconn *conn, XLogRecPtr endpos, bool keepalive, XLogRecPtr lsn)
+{
+	(void) PQputCopyEnd(conn, NULL);
+	(void) PQflush(conn);
+
+	if (verbose)
+	{
+		if (keepalive)
+			fprintf(stderr, "%s: endpos %X/%X reached by keepalive\n",
+					progname,
+					(uint32) (endpos >> 32), (uint32) endpos);
+		else
+			fprintf(stderr, "%s: endpos %X/%X reached by record at %X/%X\n",
+					progname, (uint32) (endpos >> 32), (uint32) (endpos),
+					(uint32) (lsn >> 32), (uint32) lsn);
+
+	}
+}
-- 
2.5.5

0004-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchtext/x-patch; charset=US-ASCII; name=0004-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchDownload
From 0444e32dc7cacb9a83d0a04fdfc6dcfab5b65f1d Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 15 Nov 2016 16:06:16 +0800
Subject: [PATCH 4/9] Add a pg_recvlogical wrapper to PostgresNode

---
 src/test/perl/PostgresNode.pm               | 75 ++++++++++++++++++++++++++++-
 src/test/recovery/t/006_logical_decoding.pl | 31 +++++++++++-
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 28e9f0b..08ce4fe 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1125,7 +1125,7 @@ sub psql
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
-			  if (blessed($exc_save) || $exc_save ne $timeout_exception);
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
 
 			$ret = undef;
 
@@ -1441,6 +1441,79 @@ sub slot
 	return \%val;
 }
 
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...)
+
+Invoke pg_recvlogical to read from slot_name on dbname until LSN endpos, which
+corresponds to pg_recvlogical --endpos.  Gives up after timeout (if nonzero).
+
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
+
+Plugin options are passed as additional keyword arguments.
+
+If called in scalar context, returns stdout, and die()s on timeout or nonzero return.
+
+If called in array context, returns a tuple of (retval, stdout, stderr, timeout).
+timeout is the IPC::Run::Timeout object whose is_expired method can be tested
+to check for timeout. retval is undef on timeout.
+
+=cut
+
+sub pg_recvlogical_upto
+{
+	my ($self, $dbname, $slot_name, $endpos, $timeout_secs, %plugin_options) = @_;
+	my ($stdout, $stderr);
+
+	my $timeout_exception = 'pg_recvlogical timed out';
+
+	my @cmd = ('pg_recvlogical', '-S', $slot_name, '--dbname', $self->connstr($dbname));
+	push @cmd, '--endpos', $endpos if ($endpos);
+	push @cmd, '-f', '-', '--no-loop', '--start';
+
+	while (my ($k, $v) = each %plugin_options)
+	{
+		die "= is not permitted to appear in replication option name" if ($k =~ qr/=/);
+		push @cmd, "-o", "$k=$v";
+	}
+
+	my $timeout;
+	$timeout = IPC::Run::timeout($timeout_secs, exception => $timeout_exception ) if $timeout_secs;
+	my $ret = 0;
+
+	do {
+		local $@;
+		eval {
+			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
+			$ret = $?;
+		};
+		my $exc_save = $@;
+		if ($exc_save)
+		{
+			# IPC::Run::run threw an exception. re-throw unless it's a
+			# timeout, which we'll handle by testing is_expired
+			die $exc_save
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
+
+			$ret = undef;
+
+			die "Got timeout exception '$exc_save' but timer not expired?!"
+			  unless $timeout->is_expired;
+
+			die "$exc_save waiting for endpos $endpos with stdout '$stdout', stderr '$stderr'"
+				unless wantarray;
+		}
+	};
+
+	if (wantarray)
+	{
+		return ($ret, $stdout, $stderr, $timeout);
+	}
+	else
+	{
+		die "pg_recvlogical exited with code '$ret', stdout '$stdout' and stderr '$stderr'" if $ret;
+		return $stdout;
+	}
+}
+
 =pod
 
 =back
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index b80a9a9..d8cc8d3 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -1,9 +1,13 @@
 # Testing of logical decoding using SQL interface and/or pg_recvlogical
+#
+# Most logical decoding tests are in contrib/test_decoding. This module
+# is for work that doesn't fit well there, like where server restarts
+# are required.
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -36,5 +40,30 @@ $result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_chan
 chomp($result);
 is($result, '', 'Decoding after fast restart repeats no rows');
 
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
+
+my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
+
+$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
+
 # done with the node
 $node_master->stop;
-- 
2.5.5

0005-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0005-Follow-timeline-switches-in-logical-decoding.patchDownload
From a8eb8a2b58a54d6d13b0cc83bddd7631f6d36e1f Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 5/9] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 200 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   7 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 347 insertions(+), 22 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..ab15cf3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -660,6 +661,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -675,7 +677,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -702,6 +705,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -750,6 +754,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -770,28 +897,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= read_upto)
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
+
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 318726e..a8f7b76 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -234,13 +234,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
+	ReplicationSlotAcquire(NameStr(*name));
+
 	/* compute the current end-of-wal */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
-	ReplicationSlotAcquire(NameStr(*name));
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	PG_TRY();
 	{
@@ -279,6 +279,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aa42d59..8b145e0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -758,6 +759,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -986,10 +993,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..8f96728 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -160,6 +160,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 * 
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index d027ea1..f0ee352 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index a847952..d2ff1e9 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0006-Expand-streaming-replication-tests-to-cover-hot-stan.patchtext/x-patch; charset=US-ASCII; name=0006-Expand-streaming-replication-tests-to-cover-hot-stan.patchDownload
From 3b88dba69d4e4a1a99f0c90b7e6372328c40d645 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 9 Nov 2016 13:44:04 +0800
Subject: [PATCH 6/9] Expand streaming replication tests to cover hot standby
 feedback and physical replication slots

---
 src/test/recovery/t/001_stream_rep.pl | 105 +++++++++++++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 5ce69bb..ef29892 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 4;
+use Test::More tests => 22;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -58,3 +58,106 @@ is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
 is($node_standby_2->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 2');
+
+diag "switching to physical replication slot";
+# Switch to using a physical replication slot. We can do this without a new
+# backup since physical slots can go backwards if needed. Do so on both
+# standbys. Since we're going to be testing things that affect the slot state,
+# also increase the standby feedback interval to ensure timely updates.
+my ($slotname_1, $slotname_2) = ('standby_1', 'standby_2');
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_master->restart;
+is($node_master->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_1');]), 0, 'physical slot created on master');
+$node_standby_1->append_conf('recovery.conf', "primary_slot_name = $slotname_1\n");
+$node_standby_1->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_standby_1->restart;
+is($node_standby_1->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_2');]), 0, 'physical slot created on intermediate replica');
+$node_standby_2->append_conf('recovery.conf', "primary_slot_name = $slotname_2\n");
+$node_standby_2->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+$node_standby_2->restart;
+
+sub get_slot_xmins
+{
+	my ($node, $slotname) = @_;
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+# There's no hot standby feedback and there are no logical slots on either peer
+# so xmin and catalog_xmin should be null on both slots.
+my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
+is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
+is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
+
+# Replication still works?
+$node_master->safe_psql('postgres', 'CREATE TABLE replayed(val integer);');
+
+sub replay_check
+{
+	my $newval = $node_master->safe_psql('postgres', 'INSERT INTO replayed(val) SELECT coalesce(max(val),0) + 1 AS newval FROM replayed RETURNING val');
+	$node_master->wait_for_catchup($node_standby_1);
+	$node_standby_1->wait_for_catchup($node_standby_2);
+	$node_standby_1->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
+		or die "standby_1 didn't replay master value $newval";
+	$node_standby_2->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
+		or die "standby_2 didn't replay standby_1 value $newval";
+}
+
+replay_check();
+
+diag "enabling hot_standby_feedback";
+# Enable hs_feedback. The slot should gain an xmin. We set the status interval
+# so we'll see the results promptly.
+$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_1->reload;
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_2->reload;
+replay_check();
+sleep(2);
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+isnt($xmin, '', 'non-cascaded slot xmin non-null with hs feedback');
+is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+isnt($xmin, '', 'cascaded slot xmin non-null with hs feedback');
+is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback');
+
+diag "doing some work to advance xmin";
+for my $i (10000..11000) {
+	$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES ($i);]);
+}
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my ($xmin2, $catalog_xmin2) = get_slot_xmins($node_master, $slotname_1);
+diag "new xmin $xmin2, old xmin $xmin";
+isnt($xmin2, $xmin, 'non-cascaded slot xmin with hs feedback has changed');
+is($catalog_xmin2, '', 'non-cascaded slot xmin still null with hs_feedback unchanged');
+
+($xmin2, $catalog_xmin2) = get_slot_xmins($node_standby_1, $slotname_2);
+diag "new xmin $xmin2, old xmin $xmin";
+isnt($xmin2, $xmin, 'cascaded slot xmin with hs feedback has changed');
+is($catalog_xmin2, '', 'cascaded slot xmin still null with hs_feedback unchanged');
+
+diag "disabling hot_standby_feedback";
+# Disable hs_feedback. Xmin should be cleared.
+$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_1->reload;
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_2->reload;
+replay_check();
+sleep(2);
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+is($xmin, '', 'non-cascaded slot xmin null with hs feedback reset');
+is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback reset');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+is($xmin, '', 'cascaded slot xmin null with hs feedback reset');
+is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback reset');
-- 
2.5.5

0007-Don-t-attempt-to-export-a-snapshot-from-CREATE_REPLI.patchtext/x-patch; charset=US-ASCII; name=0007-Don-t-attempt-to-export-a-snapshot-from-CREATE_REPLI.patchDownload
From a52fe8a659cf9c75e35e089b2b4d1b8b108e3124 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 1 Jun 2016 14:05:58 +0800
Subject: [PATCH 7/9] Don't attempt to export a snapshot from
 CREATE_REPLICATION_SLOT when in recovery

Exporting a snapshot requires us to start an xact, and we can't do that from a
server in recovery. So skip snapshot export. The client must handle syncing of
initial state via some external means like a slot on the master or manually
stopping replay from a physical copy at the same LSN.
---
 src/backend/replication/walsender.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8b145e0..abdacca 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -843,7 +843,18 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * Export a plain (not of the snapbuild.c type) snapshot to the user
 		 * that can be imported into another session.
 		 */
-		snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		if (!RecoveryInProgress())
+		{
+			snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		}
+		else
+		{
+			/*
+			 * Can't assign an xid during recovery so we can't export a
+			 * snapshot.
+			 */
+			snapshot_name = "";
+		}
 
 		/* don't need the decoding context anymore */
 		FreeDecodingContext(ctx);
-- 
2.5.5

0008-ERROR-if-timeline-is-zero-in-walsender.patchtext/x-patch; charset=US-ASCII; name=0008-ERROR-if-timeline-is-zero-in-walsender.patchDownload
From e26b11715ea219469737aec9e89580811e1030c8 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 1 Jun 2016 13:50:52 +0800
Subject: [PATCH 8/9] ERROR if timeline is zero in walsender

---
 src/backend/replication/walsender.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abdacca..694e777 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -521,6 +521,11 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 
+	if (ThisTimeLineID == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("run IDENTIFY_SYSTEM before trying to START_REPLICATION")));
+
 	/*
 	 * We assume here that we're logging enough information in the WAL for
 	 * log-shipping, since this is checked in PostmasterMain().
-- 
2.5.5

0009-Logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0009-Logical-decoding-on-standby.patchDownload
From aed1a4e0d1357938e765758ea9695553f7e9647c Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 9/9] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
  exit with recovery conflict on upstream drop database when it has an active
  logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.
* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
  requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
  it in checkpoints, make vacuum and bgwriter advance it.
* During decoding startup check whether catalog_xmin requirement can be satisfied
  and bail out if it can not
* Add a new recovery conflict type for conflict with catalog_xmin. Abort
  in-progress logical decoding sessions with conflict with recovery where needed
  catalog_xmin is too old
* Make extra efforts to reserve master's catalog_xmin during decoding startup
  on standby.
* Try to make sure hot_standby_feedback is active when starting
  logical decoding.
* Remove checks preventing starting logical decoding on standby
---
 contrib/pg_visibility/pg_visibility.c              |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 +-
 src/backend/access/heap/heapam.c                   |   2 +-
 src/backend/access/heap/rewriteheap.c              |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c             |   9 +
 src/backend/access/transam/varsup.c                |  15 +
 src/backend/access/transam/xact.c                  |  55 +++
 src/backend/access/transam/xlog.c                  |  26 +-
 src/backend/catalog/index.c                        |   2 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/dbcommands.c                  |   6 +
 src/backend/commands/vacuum.c                      |  13 +-
 src/backend/postmaster/bgwriter.c                  |   9 +
 src/backend/postmaster/pgstat.c                    |   2 +
 src/backend/replication/logical/decode.c           |  11 +
 src/backend/replication/logical/logical.c          | 323 ++++++++++++++-
 src/backend/replication/slot.c                     |  91 ++++-
 src/backend/replication/walreceiver.c              |  52 ++-
 src/backend/replication/walsender.c                | 135 ++++--
 src/backend/storage/ipc/procarray.c                | 201 +++++++--
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  | 147 ++++++-
 src/backend/tcop/postgres.c                        |  38 +-
 src/bin/pg_controldata/pg_controldata.c            |   2 +
 src/include/access/transam.h                       |   5 +
 src/include/access/xact.h                          |  12 +-
 src/include/catalog/pg_control.h                   |   1 +
 src/include/pgstat.h                               |   3 +-
 src/include/replication/slot.h                     |   1 +
 src/include/replication/walreceiver.h              |   3 +
 src/include/storage/procarray.h                    |   9 +-
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +
 .../recovery/t/010_logical_decoding_on_replica.pl  | 453 +++++++++++++++++++++
 35 files changed, 1547 insertions(+), 129 deletions(-)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9985e3e..4fa3ad4 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index f524fc4..5b33c97 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 50cf527..e0fd9aa 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1783,10 +1783,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1796,7 +1797,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled. New in 10.0.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby. New in 10.0.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..12c1b36 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7300,7 +7300,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..c514b7b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -810,7 +810,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 91d27d0..f454d9d 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2f7e645..f786056 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -393,6 +393,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
+	elog(DEBUG1, "XXX advancing catalogXmin from %u to %u", ShmemVariableCache->oldestCatalogXmin, oldestCatalogXmin);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d643216..3377d3e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5641,6 +5641,61 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Unless logical decoding is possible on this node, we don't care about
+		 * this record.
+		 */
+		if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+			return;
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+		/*
+		 * Notify any active logical decoding sessions to terminate if they
+		 * need the catalogs we're going to be allowed to remove after
+		 * replaying this record.
+		 */
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}
+
+	return ptr;
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce4f1fc..7fbc768 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4815,6 +4815,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -4827,6 +4828,7 @@ BootStrapXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6405,6 +6407,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6421,6 +6426,7 @@ StartupXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8450,6 +8456,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -8653,7 +8660,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9016,7 +9023,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
@@ -9204,6 +9211,16 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		/*
+		 * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+		 * require oldestCatalogXmin in checkpoints and it no longer
+		 * makes sense, so update shmem and xlog the change. This will
+		 * get written out in the next checkpoint.
+		 */
+		if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+			wal_level < WAL_LEVEL_LOGICAL)
+			UpdateOldestCatalogXmin(true);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
@@ -9372,6 +9389,7 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9470,8 +9488,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 08b646d..03976a9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2272,7 +2272,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c617abb..2170566 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -992,7 +992,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 0919ad8..3efc833 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2119,11 +2119,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 58bbf55..7c257f5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -488,6 +488,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin(false);
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
@@ -497,7 +506,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -909,7 +918,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a31d44e..ba69ae9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin(false);
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a392197..7127b9f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3307,6 +3307,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
+		case WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE:
+			event_name = "StandbyLogicalSlotCreate";
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 46cd5ba..5eaf42f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1512be5..9912800 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,10 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void WaitForMasterCatalogXminReservation(ReplicationSlot *slot);
+
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));
+
+		/*
+		 * When decoding on a standby we need a physical slot to be used by the
+		 * walrececiver so we can pin the upstream's catalog_xmin down even
+		 * over connection loss and restarts. This also gives us somewhere to
+		 * record our needed catalog xmin on the master.
+		 */
+		if (!walrcv_has_slot)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("no replication slot configured for connection to master"),
+					 errhint("Logical decoding on standby requires that a physical replication slot be used to connect the standby to the master.")));
+
+		/*
+		 * We need hot_standby_feedback to make sure the master doesn't vacuum
+		 * away tuples we need.
+		 *
+		 * This check doesn't stop the user disabling it once we check, but they
+		 * could also drop and re-create the physical replication slot without
+		 * our noticing or do other silly things. Don't do that. If they do it
+		 * anyway we'll notice and fail with conflict with recovery later.
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback is not enabled")));
+	}
 }
 
 /*
@@ -126,6 +164,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
 	 * xmin horizons by other backends, get the safe decoding xid, and inform
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -963,3 +1011,244 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.
+ *
+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.
+ *
+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally
+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply
+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */
+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{
+	TimestampTz waitStart;
+	char	   *new_status;
+	XLogRecPtr firstWaitWalEnd, lastWaitWalEnd;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(TransactionIdIsValid(slot->effective_catalog_xmin));
+	Assert(slot->effective_catalog_xmin == slot->data.catalog_xmin);
+
+	waitStart = GetCurrentTimestamp();
+	new_status = NULL;			/* we haven't changed the ps display */
+
+	/*
+	 * The master doesn't reply to hot standby feedback explicitly,
+	 * identify which message is the most recent, nor does it report
+	 * the catalog_xmin reserved.
+	 *
+	 * This leaves a potential race. If catalog_xmin is already pinned down by
+	 * some other slot on the master or another standby,
+	 * ShmemVariableCache->oldestCatalogXmin will be set by it. We don't know
+	 * if our hot standby feedback is in effect and pinning down catalog_xmin
+	 * yet. If we start at the current oldestCatalogXmin the other slot might
+	 * advance and allow vacuum to remove tuples we need before our hot standby
+	 * feedback can lock it in. This may result in a conflict with standby at
+	 * some point after we create the slot and start decoding, when we see the
+	 * new xl_xact_catalog_xmin_advance record, unless our own catalog_xmin has
+	 * advanced enough by then that we no longer need the removed catalogs.
+	 * That can only happen if the xact holding down catalog_xmin has committed
+	 * by the time the needed catalogs are removed, so we can decode it,
+	 * advance confirmed_flush_lsn, and advance restart_lsn + catalog_xmin.
+	 *
+	 * To reduce the chances of triggering this race we force immediate
+	 * hot_standby_feedback, wait for a new latestWalEnd report from the
+	 * sender, and wait until we replay past that before we take the
+	 * catalog_xmin to start from. Without the ability to ask the walsender
+	 * to verify receipt of, and successful reservation of, a specific hot
+	 * standby feedback message this is the best we can do.
+	 *
+	 * If we lose the race, decoding will fail with a recovery conflict later.
+	 * The client will have to drop the slot and try again.
+	 *
+	 * Users can further mitigate this risk with a sufficiently high
+	 * vacuum_defer_cleanup_age.
+	 *
+	 * Users can completely prevent this problem by creating a temporary
+	 * logical slot on the master and waiting for the replica to catch up to
+	 * the master's xlog insert position before they create a slot on the
+	 * replica. Then wait until a catalog_xmin is reported on the replica's
+	 * physical slot before dropping the temporary slot on the master.
+	 *
+	 * TODO: get reply from server explicitly confirming that it has applied
+	 * our hs_feedback and what the lowest catalog_xmin it can honour is.
+	 * We'll need some kind of cookie so we can tell the server is replying
+	 * to us not someone else, especially in cascading setups.
+	 */
+
+	firstWaitWalEnd = lastWaitWalEnd = WalRcv->latestWalEnd;
+
+	WalRcvForceReply();
+
+	while (lastWaitWalEnd == firstWaitWalEnd ||
+		   GetXLogReplayRecPtr(NULL) < lastWaitWalEnd ||
+		   !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+	{
+		int ret;
+		XLogRecPtr ptr = GetXLogReplayRecPtr(NULL);
+
+		elog(DEBUG1, "XXX firstEnd %X/%X, lastEnd %X/%X; ptr %X/%X; oldestCatalogXmin %u",
+			(uint32)(firstWaitWalEnd>>32), (uint32)(firstWaitWalEnd),
+			(uint32)(lastWaitWalEnd>>32), (uint32)(lastWaitWalEnd),
+			(uint32)(ptr>>32), (uint32)(ptr),
+			ShmemVariableCache->oldestCatalogXmin);
+
+		/*
+		 * We need to advance our slot's catalog_xmin to keep pace with the
+		 * latest reported position from the master. That way we won't get
+		 * canceled with a recovery conflict when the master sends catalog_xmin
+		 * updates while we're waiting for redo to catch up with the position
+		 * we saw when we started waiting.
+		 *
+		 * A problem arises here when the server sends an
+		 * xl_xact_catalog_xmin_advance with oldestCatalogXmin = 0, indicating
+		 * it is no longer reserving catalogs. Since we're creating a slot we
+		 * don't mind, but the redo code does not know that and will treat our
+		 * process as conflicting with recovery. To guard against that we'll
+		 * advance our oldestCatalogXmin to the new
+		 * GetOldestSafeDecodingTransactionId() and redo will ignore slots
+		 * whose catalog_xmin is >= nextXid. So long as we loop faster than the
+		 * maximum standby delay we'll keep ahead of recovery cancellations.
+		 * This means we must take XidGenLock once per loop, but it's not like
+		 * we spend a lot of time creating slots.
+		 *
+		 * It's fine for our catalog_xmin to go backwards when the server
+		 * reports it has nailed down catalog_xmin so we just uncondtionally
+		 * reassign our catalog_xmin.
+		 */
+		slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+		ReplicationSlotsComputeRequiredXmin(true);
+
+		LWLockRelease(ProcArrayLock);
+
+		ret = WaitLatch(&MyProc->procLatch,
+						WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						500, WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE);
+
+		if (ret & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (ret & WL_LATCH_SET)
+			ResetLatch(&MyProc->procLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Notice if the server has reported new WAL since we sent our feedback */
+		if (lastWaitWalEnd == firstWaitWalEnd)
+			lastWaitWalEnd = WalRcv->latestWalEnd;
+
+		/* Update process title if waiting long enough */
+		if (update_process_title && new_status == NULL &&
+			TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
+									   500))
+		{
+			const char *old_status;
+			int			len;
+
+			old_status = get_ps_display(&len);
+			new_status = (char *) palloc(len + 8 + 1);
+			memcpy(new_status, old_status, len);
+			strcpy(new_status + len, " waiting");
+			set_ps_display(new_status, false);
+			new_status[len] = '\0'; /* truncate off " waiting" */
+		}
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	}
+
+	if (TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin))
+	{
+		/*
+		 * We didn't reserve the catalog_xmin we wanted, the master has already removed it.
+		 * We have to start decoding at a later point.
+		 */
+		slot->effective_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	}
+
+	ReplicationSlotsComputeRequiredXmin(true);
+
+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+
+	Assert(TransactionIdFollowsOrEquals(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin));
+	Assert(LWLockHeldByMe(ProcArrayLock));
+}
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past
+	 * the threshold we need, removing tuples that we'll require to start
+	 * decoding at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is needed
+		 * here since oldestCatalogXmin can only advance, so if it's past what
+		 * we need that's not going to change. We have marked our slot as active
+		 * so redo won't replay past our catalog_xmin without first terminating our
+		 * session.
+		 */
+		TransactionId shmem_catalog_xmin =
+			*(volatile TransactionId*)(&ShmemVariableCache->oldestCatalogXmin);
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							 NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 0b2575e..35920cd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -758,6 +758,93 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ *
+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
@@ -805,7 +892,9 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
+		 * (TODO: ask walreceiver to ask walsender to log it or ask bgworker to log it)
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2bb3dce..c887523 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -499,9 +499,15 @@ WalReceiverMain(void)
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
+						if (XLogLogicalInfoActive())
+							XLogWalRcvSendHSFeedback(true);
 						XLogWalRcvSendReply(true, false);
 					}
 				}
@@ -1164,8 +1170,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	static bool master_has_standby_xmin = false;
 
@@ -1206,29 +1212,57 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+		
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 694e777..4b63af9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,7 +188,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -217,6 +216,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1554,6 +1554,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1616,7 +1621,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1637,6 +1642,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1647,59 +1668,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1724,15 +1778,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
@@ -2605,17 +2667,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2649,7 +2700,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5d487d..ecde732 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1298,17 +1298,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1382,9 +1387,13 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held. Note that
+	 * we're using the effective catalog_xmin for vacuum's tuple removal here,
+	 * as copied over by UpdateOldestCatalogXmin().
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (RecoveryInProgress())
 	{
@@ -1433,19 +1442,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
+
+	return result;
+}
+
+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to
+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	/*
+	 * If we're not recording logical decoding information, catalog_xmin
+	 * must be unset and we don't need to do any work here.
+	 *
+	 * XXX TODO make sure we zero the checkpointed value when we turn logical decoding
+	 * off, and check it during startup!!
+	 */
+	if (!XLogLogicalInfoActive() && !force)
+	{
+		Assert(!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin));
+		Assert(!TransactionIdIsValid(procArray->replication_slot_catalog_xmin));
+	}
+
+	Assert(XLogInsertAllowed());
+
 	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
+	 * Do an unlocked check first. This is obviously race-prone especially
+	 * since replication_slot_catalog_xmin could be updated after we read
+	 * oldestCatalogXmin. But it doesn't matter if we get wrong results here,
+	 * it'll either cause us to take an unnecessary ProcArrayLock to recheck,
+	 * or delay an update until the next vacuum run.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+	slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
 
-	return result;
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -2173,14 +2256,20 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
+
+	/*
+	 * TODO: If we're on replica and using hot standby feedback to set catalog_xmin
+	 * we should be able to directly check the value reserved by feedback via shmem
+	 * from walreceiver, even if xlog replay hasn't passed that point yet.
+	 */
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2662,6 +2751,53 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+			
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
@@ -2936,18 +3072,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		 *retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		 *needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index a3d6ac5..d17dba1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec..a0a051b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,7 +153,9 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
@@ -1105,3 +1108,145 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 					 nmsgs * sizeof(SharedInvalidationMessage));
 	XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
 }
+
+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		/* Reset standby wait back-off delay for each session waited for */
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but if we're an intermediate
+		 * cascading standby all we do is pass the catalog_xmin up to our
+		 * master and relay WAL down to the cascaded replica. Conflicts are the
+		 * cascaded replica's problem.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of in-use logical slots.
+		 * Inactive slots have the same effective and actual catalog_xmin, and
+		 * we'll detect conflicts with those when an attempt is made to use
+		 * them. Active slots' catalog_xmin can't go backwards unless they
+		 * become inactive.
+		 *
+		 * We specifically ignore catalog_xmin reservations >= nextXid here to allow
+		 * for slots still being created; see function comment.
+		 */
+		while (slot->in_use && slot->active_pid != 0 &&
+			   TransactionIdIsValid(slot->effective_catalog_xmin) &&
+			   (!TransactionIdIsValid(new_catalog_xmin) ||
+				TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)) &&
+			   TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->nextXid))
+		{
+			/*
+			 * Wait for the conflicting session to exit, signalling it with
+			 * a conflict if necessary.
+			 *
+			 * We'll sleep here, so release the replication slot control lock. No
+			 * new conflicts can appear "behind" our scan of the replication_slots
+			 * array because sessions check the oldestCatalogXmin on decoding
+			 * startup. This lets the exiting backend clear the slot's its
+			 * active_pid.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				/* 
+				 * As a safeguard against signalling the wrong process in case of
+				 * pid reassignment, check that the slot's active_pid hasn't been
+				 * cleared or changed. Do an unlocked read here since the worst
+				 * wrong outcome even in the case of garbage read is an extra
+				 * sleep. If you get a new backend with the same pid in the
+				 * same slot array position you have terrible luck, and it
+				 * might get cancelled with a spurious conflict. 
+				 */
+				if (active_pid != slot->active_pid)
+					continue;
+
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/*
+				 * Wait a little bit for it to die so that we avoid flooding
+				 * an unresponsive backend when system is heavily loaded.
+				 */
+				pg_usleep(5000L);
+			}
+			
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cc84754..a6baa33 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2262,6 +2262,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2698,8 +2701,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2781,6 +2788,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2795,12 +2803,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2855,11 +2864,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3bad417 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -242,6 +242,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 969eff9..50f68e8 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+
+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */
+
 } VariableCacheData;
 
 typedef VariableCacheData *VariableCache;
@@ -173,6 +177,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index a123d2a..17e4306 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -118,7 +118,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -167,6 +167,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+} xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -370,6 +377,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 0bc41ab..df19adc 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -43,6 +43,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin; /* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0b85b7a..d7817d4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -746,7 +746,8 @@ typedef enum
 	WAIT_EVENT_SYSLOGGER_MAIN,
 	WAIT_EVENT_WAL_RECEIVER_MAIN,
 	WAIT_EVENT_WAL_SENDER_MAIN,
-	WAIT_EVENT_WAL_WRITER_MAIN
+	WAIT_EVENT_WAL_WRITER_MAIN,
+	WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;
 
 /* ----------
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index e00562d..4ad2bcf 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -175,6 +175,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index cd787c9..5ba4ae8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -116,6 +116,9 @@ typedef struct
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.
+	 *
+	 * If hot standby feedback is enabled, a hot standby feedback message
+	 * will also be sent.
 	 */
 	bool		force_reply;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index dd37c0c..0592aff 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
@@ -78,6 +78,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
@@ -86,6 +88,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(bool force);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index f67b982..8e37e29 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index dcebf72..cc04186 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,6 +34,8 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..9082ddd
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,453 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use pg_lsn qw(parse_lsn);
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 4\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->append_conf('postgresql.conf', "log_error_verbosity = verbose\n");
+$node_master->append_conf('postgresql.conf', "hot_standby_feedback = on\n");
+# send status rapidly so we promptly advance xmin on master
+$node_master->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+# very promptly terminate conflicting backends
+$node_master->append_conf('postgresql.conf', "max_standby_streaming_delay = '2s'\n");
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--xlog-method=stream', '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, '', 'stderr is empty');
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# an initial decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica);
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+# If we change the catalogs, we'll get a conflict with recovery, but only
+# if there's an active xact when decoding. Logical decoding
+# doesn't keep a virtualxid while waiting for WAL, only when calling output
+# plugins, so this won't work damn.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica);
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+sleep(2);
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica);
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica);
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica);
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#12Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Craig Ringer (#11)
Re: Logical decoding on standby

On 07/12/16 07:05, Craig Ringer wrote:

On 21 November 2016 at 16:17, Craig Ringer <craig@2ndquadrant.com> wrote:

Hi all

I've prepared a working initial, somewhat raw implementation for
logical decoding on physical standbys.

Hi all

I've attached a significantly revised patch, which now incorporates
safeguards to ensure that we prevent decoding if the master has not
retained needed catalogs and cancel decoding sessions that are holding
up apply because they need too-old catalogs

The biggest change in this patch, and the main intrusive part, is that
procArray->replication_slot_catalog_xmin is no longer directly used by
vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
added, with a corresponding CheckPoint field. Vacuum notices if
procArray->replication_slot_catalog_xmin has advanced past
ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr
record with the new value before it copies it to oldestCatalogXmin.
This means that a standby can now reliably tell when catalogs are
about to be removed or become candidates for removal, so it can pause
redo until logical decoding sessions on the standby advance far enough
that their catalog_xmin passes that point. It also means that if our
hot_standby_feedback somehow fails to lock in the catalogs our slots
need on a standby, we can cancel sessions with a conflict with
recovery.

If wal_level is < logical this won't do anything, since
replication_slot_catalog_xmin and oldestCatalogXmin will both always
be 0.

Because oldestCatalogXmin advances eagerly as soon as vacuum sees the
new replication_slot_catalog_xmin, this won't impact catalog bloat.

Ideally this mechanism won't generally actually be needed, since
hot_standby_feedback stops the master from removing needed catalogs,
and we make an effort to ensure that the standby has
hot_standby_feedback enabled and is using a replication slot. We
cannot prevent the user from dropping and re-creating the physical
slot on the upstream, though, and it doesn't look simple to stop them
turning off hot_standby_feedback or turning off use of a physical slot
after creating logical slots, either. So we try to stop users shooting
themselves in the foot, but if they do it anyway we notice and cope
gracefully. Logging catalog_xmin also helps slots created on standbys
know where to start, and makes sure we can deal gracefully with a race
between hs_feedback and slot creation on a standby.

Hi,

If this mechanism would not be needed most of the time, wouldn't it be
better to not have it and just have a way to ask physical slot about
what's the current reserved catalog_xmin (in most cases the standby
should actually know what it is since it's sending the hs_feedback, but
first slot created on such standby may need to check). WRT preventing
hs_feedback going off, we can refuse to start with hs_feedback off when
there are logical slots detected. We can also refuse to connect to the
master without physical slot if there are logical slots detected. I
don't see problem with either of those.

You may ask what if user drops the slot and recreates or somehow
otherwise messes up catalog_xmin on master, well, considering that under
my proposal we'd first (on connect) check the slot for catalog_xmin we'd
know about it so we'd either mark the local slots broken/drop them or
plainly refuse to connect to the master same way as if it didn't have
required WAL anymore (not sure which behavior is better). Note that user
could mess up catalog_xmin even in your design the same way, so it's not
really a regression.

In general I don't think that it's necessary to WAL log anything for
this. It will not work without slot and therefore via archiving anyway
so writing to WAL does not seem to buy us anything. There are some
interesting side effects of cascading (ie having A->B->C replication and
creating logical slot on C) but those should not be insurmountable. Plus
it might even be okay to only allow creating logical slots on standbys
connected directly to master in v1.

That's about approach, but since there are prerequisite patches in the
patchset that don't really depend on the approach I will comment about
them as well.

0001 and 0002 add testing infrastructure and look fine to me, possibly
committable.

But in 0003 I don't understand following code:

+	if (endpos != InvalidXLogRecPtr && !do_start_slot)
+	{
+		fprintf(stderr,
+				_("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
+				progname);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}

Why is --create-slot and --endpos not allowed together?

0004 again looks good but depends on 0003.

0005 is timeline following which is IMHO ready for committer, as is 0006
and 0008 and I still maintain the opinion that these should go in soon.

0007 is unfinished as you said in your mail (missing option to specify
behavior). And the last one 0009 is the implementation discussed above,
which I think needs rework. IMHO 0007 and 0009 should be ultimately merged.

I think parts of this could be committed separately and are ready for
committer IMHO, but there is no way in CF application to mark only part
of patch-set for committer to attract the attention.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Craig Ringer
craig@2ndquadrant.com
In reply to: Petr Jelinek (#12)
Re: Logical decoding on standby

On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

The biggest change in this patch, and the main intrusive part, is that
procArray->replication_slot_catalog_xmin is no longer directly used by
vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
added, with a corresponding CheckPoint field.
[snip]

If this mechanism would not be needed most of the time, wouldn't it be
better to not have it and just have a way to ask physical slot about
what's the current reserved catalog_xmin (in most cases the standby
should actually know what it is since it's sending the hs_feedback, but
first slot created on such standby may need to check).

Yes, and that was actually my originally preferred approach, though
the one above does offer the advantage that if something goes wrong we
can detect it and fail gracefully. Possibly not worth the complexity
though.

Your approach requires us to make very sure that hot_standby_feedback
does not get turned off by user or become ineffective once we're
replicating, though, since we won't have any way to detect when needed
tuples are removed. We'd probably just bail out with relcache/syscache
lookup errors, but I can't guarantee we wouldn't crash if we tried
logical decoding on WAL where needed catalogs have been removed.

I initially ran into trouble doing that, but now think I have a way forward.

WRT preventing
hs_feedback going off, we can refuse to start with hs_feedback off when
there are logical slots detected.

Yes. There are some ordering issues there though. We load slots quite
late in startup and they don't get tracked in checkpoints. So we might
launch the walreceiver before we load slots and notice their needed
xmin/catalog_xmin. So we need to prevent sending of
hot_standby_feedback until slots are loaded, or load slots earlier in
startup. The former sounds less intrusive and safer - probably just
add an "initialized" flag to ReplicationSlotCtlData and suppress
hs_feedback until it becomes true.

We'd also have to suppress the validation callback action on the
hot_standby_feedback GUC until we know replication slot state is
initialised, and perform the check during slot startup instead. The
hot_standby_feedback GUC validation check might get called before
shmem is even set up so we have to guard against attempts to access a
shmem segment that may not event exist yet.

The general idea is workable though. Refuse to start if logical slots
exist and hot_standby_feedback is off or walreceiver isn't using a
physical slot. Refuse to allow hot_standby_feedback to change

We can also refuse to connect to the
master without physical slot if there are logical slots detected. I
don't see problem with either of those.

Agreed. We must also be able to reliably enforce that the walreceiver
is using a replication slot to connect to the master and refuse to
start if it is not. The user could change recovery.conf and restart
the walreceiver while we're running, after we perform that check, so
walreceiver must also refuse to start if logical replication slots
exist but it has no primary slot name configured.

You may ask what if user drops the slot and recreates or somehow
otherwise messes up catalog_xmin on master, well, considering that under
my proposal we'd first (on connect) check the slot for catalog_xmin we'd
know about it so we'd either mark the local slots broken/drop them or
plainly refuse to connect to the master same way as if it didn't have
required WAL anymore (not sure which behavior is better). Note that user
could mess up catalog_xmin even in your design the same way, so it's not
really a regression.

Agreed. Checking catalog_xmin of the slot when we connect is
sufficient to guard against that, assuming we can trust that the
catalog_xmin is actually in effect on the master. Consider cascading
setups, where we set our catalog_xmin but it might not be "locked in"
until the middle cascaded server relays it to the master.

I have a proposed solution to that which I'll outline in a separate
patch+post; it ties in to some work on addressing the race between hot
standby feedback taking effect and queries starting on the hot
standby. It boils down to "add a hot_standby_feedback reply protocol
message".

Plus
it might even be okay to only allow creating logical slots on standbys
connected directly to master in v1.

True. I didn't consider that.

We haven't had much luck in the past with such limitations, but
personally I'd consider it a perfectly reasonable one.

But in 0003 I don't understand following code:

+     if (endpos != InvalidXLogRecPtr && !do_start_slot)
+     {
+             fprintf(stderr,
+                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
+                             progname);
+             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+                             progname);
+             exit(1);
+     }

Why is --create-slot and --endpos not allowed together?

What would --create-slot with --endpos do?

Ah. I had not realised that it is legal to do

pg_recvlogical -S test --create-slot --start -f - -d 'test'

i.e. "create a slot then in the same invocation begin decoding from it".

I misread and thought that --create-slot and --start were mutually exclusive.

I will address that.

0005 is timeline following which is IMHO ready for committer, as is 0006
and 0008 and I still maintain the opinion that these should go in soon.

I wonder if I should re-order 0005 and 0006 so we can commit the
hot_standby test improvements before logical decoding timeline
following.

I think parts of this could be committed separately and are ready for
committer IMHO, but there is no way in CF application to mark only part
of patch-set for committer to attract the attention.

Yeah. I raised that before and nobody was really sure what to do about
it. It's confusing to post patches on the same thread on separate CF
entries. It's also confusing to post patches on a nest of
inter-related threads to allow each thread to be tracked by a separate
CF entry.

At the moment I'm aiming to progressively get the underlying
infrastructure/test stuff in so we can focus on the core feature.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Craig Ringer
craig@2ndquadrant.com
In reply to: Petr Jelinek (#12)
Re: Logical decoding on standby

On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

But in 0003 I don't understand following code:

+     if (endpos != InvalidXLogRecPtr && !do_start_slot)
+     {
+             fprintf(stderr,
+                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
+                             progname);
+             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+                             progname);
+             exit(1);
+     }

Why is --create-slot and --endpos not allowed together?

Actually, the test is fine, the error is just misleading due to my
misunderstanding.

The fix is simply to change the error message to

_("%s: --endpos may only be specified
with --start\n"),

so I won't post a separate followup patch.

Okano Naoki tried to bring this to my attention earlier, but I didn't
understand as I hadn't yet realised you could use --create-slot
--start, they weren't mutually exclusive.

(We test to ensure --start --drop-slot isn't specified earlier so no
test for do_drop_slot is required).

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Craig Ringer (#14)
Re: Logical decoding on standby

On 21/12/16 04:31, Craig Ringer wrote:

On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

But in 0003 I don't understand following code:

+     if (endpos != InvalidXLogRecPtr && !do_start_slot)
+     {
+             fprintf(stderr,
+                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
+                             progname);
+             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+                             progname);
+             exit(1);
+     }

Why is --create-slot and --endpos not allowed together?

Actually, the test is fine, the error is just misleading due to my
misunderstanding.

The fix is simply to change the error message to

_("%s: --endpos may only be specified
with --start\n"),

so I won't post a separate followup patch.

Ah okay makes sense. The --create-slot + --endpos should definitely be
allowed combination, especially now that we can extend this to
optionally use temporary slot.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Craig Ringer (#13)
Re: Logical decoding on standby

On 21/12/16 04:06, Craig Ringer wrote:

On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

The biggest change in this patch, and the main intrusive part, is that
procArray->replication_slot_catalog_xmin is no longer directly used by
vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
added, with a corresponding CheckPoint field.
[snip]

If this mechanism would not be needed most of the time, wouldn't it be
better to not have it and just have a way to ask physical slot about
what's the current reserved catalog_xmin (in most cases the standby
should actually know what it is since it's sending the hs_feedback, but
first slot created on such standby may need to check).

Yes, and that was actually my originally preferred approach, though
the one above does offer the advantage that if something goes wrong we
can detect it and fail gracefully. Possibly not worth the complexity
though.

Your approach requires us to make very sure that hot_standby_feedback
does not get turned off by user or become ineffective once we're
replicating, though, since we won't have any way to detect when needed
tuples are removed. We'd probably just bail out with relcache/syscache
lookup errors, but I can't guarantee we wouldn't crash if we tried
logical decoding on WAL where needed catalogs have been removed.

I initially ran into trouble doing that, but now think I have a way forward.

WRT preventing
hs_feedback going off, we can refuse to start with hs_feedback off when
there are logical slots detected.

Yes. There are some ordering issues there though. We load slots quite
late in startup and they don't get tracked in checkpoints. So we might
launch the walreceiver before we load slots and notice their needed
xmin/catalog_xmin. So we need to prevent sending of
hot_standby_feedback until slots are loaded, or load slots earlier in
startup. The former sounds less intrusive and safer - probably just
add an "initialized" flag to ReplicationSlotCtlData and suppress
hs_feedback until it becomes true.

We'd also have to suppress the validation callback action on the
hot_standby_feedback GUC until we know replication slot state is
initialised, and perform the check during slot startup instead. The
hot_standby_feedback GUC validation check might get called before
shmem is even set up so we have to guard against attempts to access a
shmem segment that may not event exist yet.

The general idea is workable though. Refuse to start if logical slots
exist and hot_standby_feedback is off or walreceiver isn't using a
physical slot. Refuse to allow hot_standby_feedback to change

These are all problems associated with replication slots being in shmem
if I understand correctly. I wonder, could we put just bool someplace
which is available early that says if there are any logical slots
defined? We don't actually need all the slot info, just to know if there
are some.

You may ask what if user drops the slot and recreates or somehow
otherwise messes up catalog_xmin on master, well, considering that under
my proposal we'd first (on connect) check the slot for catalog_xmin we'd
know about it so we'd either mark the local slots broken/drop them or
plainly refuse to connect to the master same way as if it didn't have
required WAL anymore (not sure which behavior is better). Note that user
could mess up catalog_xmin even in your design the same way, so it's not
really a regression.

Agreed. Checking catalog_xmin of the slot when we connect is
sufficient to guard against that, assuming we can trust that the
catalog_xmin is actually in effect on the master. Consider cascading
setups, where we set our catalog_xmin but it might not be "locked in"
until the middle cascaded server relays it to the master.

I have a proposed solution to that which I'll outline in a separate
patch+post; it ties in to some work on addressing the race between hot
standby feedback taking effect and queries starting on the hot
standby. It boils down to "add a hot_standby_feedback reply protocol
message".

Plus
it might even be okay to only allow creating logical slots on standbys
connected directly to master in v1.

True. I didn't consider that.

We haven't had much luck in the past with such limitations, but
personally I'd consider it a perfectly reasonable one.

I think it's infinitely better with that limitation than the status quo.
Especially for failover scenario (you usually won't failover to replica
down the cascade as it's always more behind). Not to mention that with
every level of cascading you get automatically more lag which means more
bloat so it might not even be all that desirable to go that route
immediately in v1 when we don't have way to control that bloat/maximum
slot lag.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#13)
Re: Logical decoding on standby

On Tue, Dec 20, 2016 at 10:06 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

The biggest change in this patch, and the main intrusive part, is that
procArray->replication_slot_catalog_xmin is no longer directly used by
vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
added, with a corresponding CheckPoint field.
[snip]

If this mechanism would not be needed most of the time, wouldn't it be
better to not have it and just have a way to ask physical slot about
what's the current reserved catalog_xmin (in most cases the standby
should actually know what it is since it's sending the hs_feedback, but
first slot created on such standby may need to check).

Yes, and that was actually my originally preferred approach, though
the one above does offer the advantage that if something goes wrong we
can detect it and fail gracefully. Possibly not worth the complexity
though.

Your approach requires us to make very sure that hot_standby_feedback
does not get turned off by user or become ineffective once we're
replicating, though, since we won't have any way to detect when needed
tuples are removed. We'd probably just bail out with relcache/syscache
lookup errors, but I can't guarantee we wouldn't crash if we tried
logical decoding on WAL where needed catalogs have been removed.

I dunno, Craig, I think your approach sounds more robust. It's not
very nice to introduce a bunch of random prohibitions on what works
with what, and it doesn't sound like it's altogether watertight
anyway. Incorporating an occasional, small record into the WAL stream
to mark the advancement of the reserved catalog_xmin seems like a
cleaner and safer solution. We certainly do NOT want to find out
about corruption only because of random relcache/syscache lookup
failures, let alone crashes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Petr Jelinek (#12)
Re: Logical decoding on standby

On Tue, Dec 20, 2016 at 4:03 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

That's about approach, but since there are prerequisite patches in the
patchset that don't really depend on the approach I will comment about
them as well.

0001 and 0002 add testing infrastructure and look fine to me, possibly
committable.

But in 0003 I don't understand following code:

+     if (endpos != InvalidXLogRecPtr && !do_start_slot)
+     {
+             fprintf(stderr,
+                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
+                             progname);
+             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+                             progname);
+             exit(1);
+     }

Why is --create-slot and --endpos not allowed together?

0004 again looks good but depends on 0003.

0005 is timeline following which is IMHO ready for committer, as is 0006
and 0008 and I still maintain the opinion that these should go in soon.

0007 is unfinished as you said in your mail (missing option to specify
behavior). And the last one 0009 is the implementation discussed above,
which I think needs rework. IMHO 0007 and 0009 should be ultimately merged.

I think parts of this could be committed separately and are ready for
committer IMHO, but there is no way in CF application to mark only part
of patch-set for committer to attract the attention.

Craig has pinged me about looking at 0001, 0002, 0004 and 0006 as
those involve the TAP infrastructure.

So, for 0001:
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -93,6 +93,7 @@ use RecursiveCopy;
 use Socket;
 use Test::More;
 use TestLib ();
+use pg_lsn qw(parse_lsn);
 use Scalar::Util qw(blessed);
This depends on 0002, so the order should be reversed.
+sub lsn
+{
+   my $self = shift;
+   return $self->safe_psql('postgres', 'select case when
pg_is_in_recovery() then pg_last_xlog_replay_location() else
pg_current_xlog_insert_location() end as lsn;');
+}
The design of the test should be in charge of choosing which value it
wants to get, and the routine should just blindly do the work. More
flexibility is more useful to design tests. So it would be nice to
have one routine able to switch at will between 'flush', 'insert',
'write', 'receive' and 'replay modes to get values from the
corresponding xlog functions.
-       die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
+       die "error running SQL: '$$stderr'\nwhile running
'@psql_params' with sql '$sql'"
          if $ret == 3;
That's separate from this patch, and definitely useful.
+   if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
+       die "valid modes are restart, confirmed_flush";
+   }
+   if (!defined($target_lsn)) {
+       $target_lsn = $self->lsn;
+   }
That's not really the style followed by the perl scripts present in
the code regarding the use of the brackets. Do we really need to care
about the object type checks by the way?

Regarding wait_for_catchup, there are two ways to do things. Either
query the standby like in the way 004_timeline_switch.pl does it or
the way this routine does. The way of this routine looks more
straight-forward IMO, and other tests should be refactored to use it.
In short I would make the target LSN a mandatory argument, and have
the caller send a standby's application_name instead of a PostgresNode
object, the current way to enforce the value of $standby_name being
really confusing.

+ my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1,
'replay' => 1 );
What's actually the point of 'sent'?

+   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
@fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
'$slot_name'");
+   $result = undef if $result eq '';
+   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
Couldn't this portion be made more generic? Other queries could
benefit from that by having a routine that accepts as argument an
array of column names for example.

Now looking at 0002....
One whitespace:
$ git diff HEAD~1 --check
src/test/perl/pg_lsn.pm:139: trailing whitespace.
+=cut

pg_lsn sounds like a fine name, now we are more used to camel case for
module names. And routines are written as lower case separated by an
underscore.

+++ b/src/test/perl/t/002_pg_lsn.pl
@@ -0,0 +1,68 @@
+use strict;
+use warnings;
+use Test::More tests => 42;
+use Scalar::Util qw(blessed);
Most of the tests added don't have a description. This makes things
harder to debug when tracking an issue.

It may be good to begin using this module within the other tests in
this patch as well. Now do we actually need it? Most of the existing
tests I recall rely on the backend's operators for the pg_lsn data
type, so this is actually duplicating an exiting facility. And all the
values are just passed as-is.

+++ b/src/test/perl/t/001_load.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+use Test::More tests => 5;
I can guess the meaning of this test, having a comment on top of it to
explain the purpose of the test is good practice though.

Looking at 0004...
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
s/pg_recvlogial/pg_recvlogical

+sub pg_recvlogical_upto
+{
This looks like a good idea for your tests.
+my $endpos = $node_master->safe_psql('postgres', "SELECT location
FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY
location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
On the same wave as the pg_recvlogical wrapper, you may want to
consider some kind of wrapper at SQL level calling the slot functions.
And finally 0006.
+$node_standby_1->append_conf('recovery.conf', "primary_slot_name =
$slotname_1\n");
+$node_standby_1->append_conf('postgresql.conf',
"wal_receiver_status_interval = 1\n");
+$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
No need to call multiple times this routine.

Increasing the test coverage is definitely worth it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Craig Ringer
craig@2ndquadrant.com
In reply to: Michael Paquier (#18)
Re: Logical decoding on standby

On 22 December 2016 at 13:43, Michael Paquier <michael.paquier@gmail.com> wrote:

So, for 0001:
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -93,6 +93,7 @@ use RecursiveCopy;
use Socket;
use Test::More;
use TestLib ();
+use pg_lsn qw(parse_lsn);
use Scalar::Util qw(blessed);
This depends on 0002, so the order should be reversed.

Will do. That was silly.

I think I should probably also move the standby tests earlier, then
add a patch to update them when the results change.

+sub lsn
+{
+   my $self = shift;
+   return $self->safe_psql('postgres', 'select case when
pg_is_in_recovery() then pg_last_xlog_replay_location() else
pg_current_xlog_insert_location() end as lsn;');
+}
The design of the test should be in charge of choosing which value it
wants to get, and the routine should just blindly do the work. More
flexibility is more useful to design tests. So it would be nice to
have one routine able to switch at will between 'flush', 'insert',
'write', 'receive' and 'replay modes to get values from the
corresponding xlog functions.

Fair enough. I can amend that.

-       die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
+       die "error running SQL: '$$stderr'\nwhile running
'@psql_params' with sql '$sql'"
if $ret == 3;
That's separate from this patch, and definitely useful.

Yeah. Slipped through. I don't think it really merits a separate patch
though tbh.

+   if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
+       die "valid modes are restart, confirmed_flush";
+   }
+   if (!defined($target_lsn)) {
+       $target_lsn = $self->lsn;
+   }
That's not really the style followed by the perl scripts present in
the code regarding the use of the brackets. Do we really need to care
about the object type checks by the way?

Brackets: will look / fix.

Type checks (not in quoted snippet above): that's a convenience to let
you pass a PostgresNode instance or a string name. Maybe there's a
more idiomatic Perl-y way to write it. My Perl is pretty dire.

Regarding wait_for_catchup, there are two ways to do things. Either
query the standby like in the way 004_timeline_switch.pl does it or
the way this routine does. The way of this routine looks more
straight-forward IMO, and other tests should be refactored to use it.
In short I would make the target LSN a mandatory argument, and have
the caller send a standby's application_name instead of a PostgresNode
object, the current way to enforce the value of $standby_name being
really confusing.

Hm, ok. I'll take a look. Making LSN mandatory so you have to pass
$self->lsn is ok with me.

+ my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1,
'replay' => 1 );
What's actually the point of 'sent'?

Pretty useless, but we expose it in Pg, so we might as well in the tests.

+   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
@fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
'$slot_name'");
+   $result = undef if $result eq '';
+   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
Couldn't this portion be made more generic? Other queries could
benefit from that by having a routine that accepts as argument an
array of column names for example.

Yeah, probably. I'm not sure where it should live though - TestLib.pm ?

Not sure if there's an idomatic way to pass a string (in this case
queyr) in Perl with a placeholder for interpolation of values (in this
case columns). in Python you'd pass it with pre-defined
%(placeholders)s for %.

Now looking at 0002....
One whitespace:
$ git diff HEAD~1 --check
src/test/perl/pg_lsn.pm:139: trailing whitespace.
+=cut

Will fix.

pg_lsn sounds like a fine name, now we are more used to camel case for
module names. And routines are written as lower case separated by an
underscore.

Unsure what the intent of this is.

+++ b/src/test/perl/t/002_pg_lsn.pl
@@ -0,0 +1,68 @@
+use strict;
+use warnings;
+use Test::More tests => 42;
+use Scalar::Util qw(blessed);
Most of the tests added don't have a description. This makes things
harder to debug when tracking an issue.

It may be good to begin using this module within the other tests in
this patch as well. Now do we actually need it? Most of the existing
tests I recall rely on the backend's operators for the pg_lsn data
type, so this is actually duplicating an exiting facility. And all the
values are just passed as-is.

I added it mainly for ordered tests of whether some expected lsn had
passed/increased. But maybe it makes sense to just call into the
server and let it evaluate such tests.

+++ b/src/test/perl/t/001_load.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+use Test::More tests => 5;
I can guess the meaning of this test, having a comment on top of it to
explain the purpose of the test is good practice though.

Will.

Looking at 0004...
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
s/pg_recvlogial/pg_recvlogical

Thanks.

+sub pg_recvlogical_upto
+{
This looks like a good idea for your tests.

Yeah, and likely others too as we start doing more with logical
replication in future.

+my $endpos = $node_master->safe_psql('postgres', "SELECT location
FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY
location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
On the same wave as the pg_recvlogical wrapper, you may want to
consider some kind of wrapper at SQL level calling the slot functions.

I'd really rather beg off that until needed later. The SQL functions
are simple to invoke from PostgresNode::psql in the mean time; not so
much so with pg_recvlogical.

And finally 0006.
+$node_standby_1->append_conf('recovery.conf', "primary_slot_name =
$slotname_1\n");
+$node_standby_1->append_conf('postgresql.conf',
"wal_receiver_status_interval = 1\n");
+$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
No need to call multiple times this routine.

Increasing the test coverage is definitely worth it.

Thanks.

I'll follow up with amendments. I've also implemented Petr's
suggestion to allow explicit omission of a snapshot on slot creation.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#19)
2 attachment(s)
Re: Logical decoding on standby

On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote:

changes-in-0001-v2.diff shows the changes to PostgresNode.pm per
Michael's comments, and applies on top of 0001.

I also attach a patch to add a new CREATE_REPLICATION_SLOT option per
Petr's suggestion, so you can request a slot be created
WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of
silently suppressing snapshot export when a slot was created on a
replica. It'll conflict (easily resolved) if applied on top of the
current series.

I have more to do before re-posting the full series, so waiting on
author at this point. The PostgresNode changes likely break later
tests, I'm just posting them so there's some progress here and so I
don't forget over the next few days' distraction.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

changes-in-0001-v2.difftext/plain; charset=US-ASCII; name=changes-in-0001-v2.diffDownload
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 28e9f0b..64a4633 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -93,7 +93,6 @@ use RecursiveCopy;
 use Socket;
 use Test::More;
 use TestLib ();
-use pg_lsn qw(parse_lsn);
 use Scalar::Util qw(blessed);
 
 our @EXPORT = qw(
@@ -1325,38 +1324,62 @@ sub run_log
 	TestLib::run_log(@_);
 }
 
-=pod $node->lsn
+=pod $node->lsn(mode)
 
-Return pg_current_xlog_insert_location() or, on a replica,
-pg_last_xlog_replay_location().
+Look up xlog positions on the server:
+
+* insert position (master only, error on replica)
+* write position (master only, error on replica)
+* flush position
+* receive position (always undef on master)
+* replay position
+
+mode must be specified.
 
 =cut
 
 sub lsn
 {
-	my $self = shift;
-	return $self->safe_psql('postgres', 'select case when pg_is_in_recovery() then pg_last_xlog_replay_location() else pg_current_xlog_insert_location() end as lsn;');
+	my ($self, $mode) = @_;
+	my %modes = ('insert' => 'pg_current_xlog_insert_location()',
+				 'flush' => 'pg_current_xlog_flush_location()',
+				 'write' => 'pg_current_xlog_location()',
+				 'receive' => 'pg_last_xlog_receive_location()',
+				 'replay' => 'pg_last_xlog_replay_location()');
+
+	$mode = '<undef>' if !defined($mode);
+	die "unknown mode for 'lsn': '$mode', valid modes are " . join(', ', keys %modes)
+		if !defined($modes{$mode});
+
+	my $result = $self->safe_psql('postgres', "SELECT $modes{$mode}");
+	chomp($result);
+	if ($result eq '')
+	{
+		return undef;
+	}
+	else
+	{
+		return $result;
+	}
 }
 
 =pod $node->wait_for_catchup(standby_name, mode, target_lsn)
 
 Wait for the node with application_name standby_name (usually from node->name)
-until its replication equals or passes the upstream's xlog insert point at the
-time this function is called. By default the replay_location is waited for,
-but 'mode' may be specified to wait for any of sent|write|flush|replay.
+until its replication position in pg_stat_replication equals or passes the
+upstream's xlog insert point at the time this function is called. By default
+the replay_location is waited for, but 'mode' may be specified to wait for any
+of sent|write|flush|replay.
 
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
 Requires that the 'postgres' db exists and is accessible.
 
-If pos is passed, use that xlog position instead of the server's current
-xlog insert position.
+target_lsn may be any arbitrary lsn, but is typically $master_node->lsn('insert').
 
 This is not a test. It die()s on failure.
 
-Returns the LSN caught up to.
-
 =cut
 
 sub wait_for_catchup
@@ -1364,24 +1387,25 @@ sub wait_for_catchup
 	my ($self, $standby_name, $mode, $target_lsn) = @_;
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1 );
-	die "valid modes are " . join(', ', keys(%valid_modes)) unless exists($valid_modes{$mode});
-	if ( blessed( $standby_name ) && $standby_name->isa("PostgresNode") ) {
+	die "unknown mode $mode for 'wait_for_catchup', valid modes are " . join(', ', keys(%valid_modes)) unless exists($valid_modes{$mode});
+	# Allow passing of a PostgresNode instance as shorthand
+	if ( blessed( $standby_name ) && $standby_name->isa("PostgresNode") )
+	{
 		$standby_name = $standby_name->name;
 	}
-	if (!defined($target_lsn)) {
-		$target_lsn = $self->lsn;
-	}
-	$self->poll_query_until('postgres', qq[SELECT '$target_lsn' <= ${mode}_location FROM pg_catalog.pg_stat_replication WHERE application_name = '$standby_name';])
-		or die "timed out waiting for catchup";
-	return $target_lsn;
+	die 'target_lsn must be specified' unless defined($target_lsn);
+	print "Waiting for replication conn " . $standby_name . "'s " . $mode . "_location to pass " . $target_lsn . " on " . $self->name . "\n";
+	my $query = qq[SELECT '$target_lsn' <= ${mode}_location FROM pg_catalog.pg_stat_replication WHERE application_name = '$standby_name';];
+	$self->poll_query_until('postgres', $query)
+		or die "timed out waiting for catchup, current position is " . ($self->safe_psql('postgres', $query) || '(unknown)');
+	print "done";
 }
 
 =pod $node->wait_for_slot_catchup(slot_name, mode, target_lsn)
 
-Wait for the named replication slot to equal or pass the xlog position of the
-server, or the supplied target_lsn if given. The position used is the
-restart_lsn unless mode is given, in which case it may be 'restart' or
-'confirmed_flush'.
+Wait for the named replication slot to equal or pass the supplied target_lsn.
+The position used is the restart_lsn unless mode is given, in which case it may
+be 'restart' or 'confirmed_flush'.
 
 Requires that the 'postgres' db exists and is accessible.
 
@@ -1389,9 +1413,9 @@ This is not a test. It die()s on failure.
 
 If the slot is not active, will time out after poll_query_until's timeout.
 
-Note that for logical slots, restart_lsn is held down by the oldest in progress tx.
+target_lsn may be any arbitrary lsn, but is typically $master_node->lsn('insert').
 
-Returns the LSN caught up to.
+Note that for logical slots, restart_lsn is held down by the oldest in-progress tx.
 
 =cut
 
@@ -1399,15 +1423,55 @@ sub wait_for_slot_catchup
 {
 	my ($self, $slot_name, $mode, $target_lsn) = @_;
 	$mode = defined($mode) ? $mode : 'restart';
-	if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
+	if (!($mode eq 'restart' || $mode eq 'confirmed_flush'))
+	{
 		die "valid modes are restart, confirmed_flush";
 	}
-	if (!defined($target_lsn)) {
-		$target_lsn = $self->lsn;
-	}
-	$self->poll_query_until('postgres', qq[SELECT '$target_lsn' <= ${mode}_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name';])
-		or die "timed out waiting for catchup";
-	return $target_lsn;
+	die 'target lsn must be specified' unless defined($target_lsn);
+	print "Waiting for replication slot " . $slot_name . "'s " . $mode . "_lsn to pass " . $target_lsn . " on " . $self->name . "\n";
+	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name';];
+	$self->poll_query_until('postgres', $query)
+		or die "timed out waiting for catchup, current position is " . ($self->safe_psql('postgres', $query) || '(unknown)');
+	print "done\n";
+}
+
+=pod $node->query_hash($dbname, $query, @columns)
+
+Execute $query on $dbname, replacing any appearance of the string __COLUMNS__
+within the query with a comma-separated list of @columns.
+
+If __COLUMNS__ does not appear in the query, its result columns must EXACTLY
+match the order and number (but not necessarily alias) of supplied @columns.
+
+The query must return zero or one rows.
+
+Return a hash-ref representation of the results of the query, with any empty
+or null results as defined keys with an empty-string value. There is no way
+to differentiate between null and empty-string result fields.
+
+If the query returns zero rows, return a hash with all columns empty. There
+is no way to differentiate between zero rows returned and a row with only
+null columns.
+
+=cut
+
+sub query_hash
+{
+	my ($self, $dbname, $query, @columns) = @_;
+	die 'calls in array context for multi-row results not supported yet' if (wantarray);
+	# Replace __COLUMNS__ if found
+	substr($query, index($query, '__COLUMNS__'), length('__COLUMNS__')) = join(', ', @columns)
+		if index($query, '__COLUMNS__') >= 0;
+	my $result = $self->safe_psql($dbname, $query);
+	$result = undef if $result eq '';
+	# hash slice, see http://stackoverflow.com/a/16755894/398670 .
+	#
+	# Fills the hash with empty strings produced by x-operator element
+	# duplication if result is an empty row
+	#
+	my %val;
+	@val{@columns} = $result ne '' ? split(qr/\|/, $result) : ('',) x scalar(@columns);
+	return \%val;
 }
 
 =pod $node->slot(slot_name)
@@ -1426,19 +1490,8 @@ either.
 sub slot
 {
 	my ($self, $slot_name) = @_;
-	my @fields = ('plugin', 'slot_type', 'datoid', 'database', 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
-	my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ', @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'");
-	$result = undef if $result eq '';
-	# hash slice, see http://stackoverflow.com/a/16755894/398670 .
-	#
-	# Fills the hash with empty strings produced by x-operator element
-	# duplication if result is an empty row
-	#
-	my %val;
-	@val{@fields} = $result ne '' ? split(qr/\|/, $result) : ('',) x scalar(@fields);
-	$val{'restart_lsn_arr'} = parse_lsn($val{'restart_lsn'});
-	$val{'confirmed_flush_lsn_arr'} = parse_lsn($val{'confirmed_flush_lsn'});
-	return \%val;
+	my @columns = ('plugin', 'slot_type', 'datoid', 'database', 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+	return $self->query_hash('postgres', "SELECT __COLUMNS__ FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'", @columns);
 }
 
 =pod
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 5ce69bb..ba1da8c 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -40,8 +40,8 @@ $node_master->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
 
 # Wait for standbys to catch up
-$node_master->wait_for_catchup($node_standby_1);
-$node_standby_1->wait_for_catchup($node_standby_2);
+$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('insert'));
+$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('replay'));
 
 my $result =
   $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index 5f3b2fe..7c6587a 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -32,14 +32,9 @@ $node_standby_2->start;
 # Create some content on master
 $node_master->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $until_lsn =
-  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
 
 # Wait until standby has replayed enough data on standby 1
-my $caughtup_query =
-  "SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
-$node_standby_1->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby to catch up";
+$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('write'));
 
 # Stop and remove master, and promote standby 1, switching it to a new timeline
 $node_master->teardown_node;
@@ -50,7 +45,7 @@ rmtree($node_standby_2->data_dir . '/recovery.conf');
 my $connstr_1 = $node_standby_1->connstr;
 $node_standby_2->append_conf(
 	'recovery.conf', qq(
-primary_conninfo='$connstr_1'
+primary_conninfo='$connstr_1 application_name=@{[$node_standby_2->name]}'
 standby_mode=on
 recovery_target_timeline='latest'
 ));
@@ -60,12 +55,7 @@ $node_standby_2->restart;
 # to ensure that the timeline switch has been done.
 $node_standby_1->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-$until_lsn = $node_standby_1->safe_psql('postgres',
-	"SELECT pg_current_xlog_location();");
-$caughtup_query =
-  "SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
-$node_standby_2->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby to catch up";
+$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('write'));
 
 my $result =
   $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
make-snapshot-export-on-logical-slot-creation-option.patchtext/x-patch; charset=US-ASCII; name=make-snapshot-export-on-logical-slot-creation-option.patchDownload
From 9dce1252641cde15a33198e7c117bc6138a94103 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 21 Dec 2016 11:21:46 +0800
Subject: [PATCH] Make snapshot export on logical slot creation optional

Allow logical decoding slot creation via the walsender protocol's
CREATE_REPLICATION_SLOT command to optionally suppress exporting of
a snapshot when the WITHOUT_SNAPSHOT option is passed.

This means that when we allow creation of replication slots on standbys, which
cannot export snapshots, we don't have to silently omit the snapshot creation.
It also allows clients like pg_recvlogical, which neither need nor can use the
exported snapshot, to suppress its creation. Since snapshot exporting can fail
this improves reliability.
---
 doc/src/sgml/logicaldecoding.sgml      | 13 ++++++++++---
 doc/src/sgml/protocol.sgml             | 17 +++++++++++++++--
 src/backend/replication/repl_gram.y    | 10 +++++++++-
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    |  9 ++++++++-
 src/bin/pg_basebackup/streamutil.c     |  5 +++++
 src/include/nodes/replnodes.h          |  1 +
 7 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 484915d..c0b6987 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -268,11 +268,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </para>
    </sect2>
 
-   <sect2>
+   <sect2 id="logicaldecoding-snapshot-export" xreflabel="Exported Snapshots (Logical Decoding)">
     <title>Exported Snapshots</title>
     <para>
-     When a new replication slot is created using the streaming replication interface,
-     a snapshot is exported
+     When <link linkend="protocol-replication-create-slot">a new replication
+     slot is created using the streaming replication interface</>, a snapshot
+     is exported
      (see <xref linkend="functions-snapshot-synchronization">), which will show
      exactly the state of the database after which all changes will be
      included in the change stream. This can be used to create a new replica by
@@ -282,6 +283,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
      database's state at that point in time, which afterwards can be updated
      using the slot's contents without losing any changes.
     </para>
+    <para>
+     Creation of a snapshot is not always possible - in particular, it will
+     fail when connected to a hot standby. Applications that do not require
+     snapshot export may suppress it with the <literal>WITHOUT_SNAPSHOT</>
+     option.
+    </para>
    </sect2>
   </sect1>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 50cf527..c3e5c58 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1433,8 +1433,8 @@ The commands accepted in walsender mode are:
     </listitem>
   </varlistentry>
 
-  <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+  <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> [<literal>WITHOUT_SNAPSHOT</>] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1474,6 +1474,19 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>WITHOUT_SNAPSHOT</></term>
+       <listitem>
+        <para>
+         By default, logical replication slot creation exports a snapshot for
+         use in initialization; see <xref linkend="logicaldecoding-snapshot-export">.
+         Because not all clients need an exported snapshot its creation can
+         be suppressed with <literal>WITHOUT_SNAPSHOT</>.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
      </variablelist>
     </listitem>
   </varlistentry>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index fd0fa6d..85091bd 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -77,6 +77,7 @@ Node *replication_parse_result;
 %token K_LOGICAL
 %token K_SLOT
 %token K_RESERVE_WAL
+%token K_WITHOUT_SNAPSHOT
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,6 +91,7 @@ Node *replication_parse_result;
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
 %type <boolval>	opt_reserve_wal
+%type <boolval> opt_without_snapshot
 
 %%
 
@@ -194,13 +196,14 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT
+			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT opt_without_snapshot
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->plugin = $4;
+					cmd->without_snapshot = $5;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -276,6 +279,11 @@ opt_reserve_wal:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_without_snapshot:
+			K_WITHOUT_SNAPSHOT				{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index f83ec53..ae2784f 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -96,6 +96,7 @@ DROP_REPLICATION_SLOT		{ return K_DROP_REPLICATION_SLOT; }
 TIMELINE_HISTORY	{ return K_TIMELINE_HISTORY; }
 PHYSICAL			{ return K_PHYSICAL; }
 RESERVE_WAL			{ return K_RESERVE_WAL; }
+WITHOUT_SNAPSHOT	{ return K_WITHOUT_SNAPSHOT; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8b145e0..f7448a6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -843,7 +843,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * Export a plain (not of the snapbuild.c type) snapshot to the user
 		 * that can be imported into another session.
 		 */
-		snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		if (cmd->without_snapshot)
+			snapshot_name = "";
+		else if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot export a snapshot from a standby")));
+		else
+			snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
 
 		/* don't need the decoding context anymore */
 		FreeDecodingContext(ctx);
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 595eaff..7b1b2ee 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -346,8 +346,13 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" PHYSICAL",
 						  slot_name);
 	else
+	{
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" LOGICAL \"%s\"",
 						  slot_name, plugin);
+		if (PQserverVersion(conn) >= 100000)
+			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
+			appendPQExpBuffer(query, " WITHOUT_SNAPSHOT");
+	}
 
 	res = PQexec(conn, query->data);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index d2f1edb..9864594 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		reserve_wal;
+	bool		without_snapshot;
 } CreateReplicationSlotCmd;
 
 
-- 
2.5.5

#21Andrew Dunstan
andrew@dunslane.net
In reply to: Craig Ringer (#19)
Re: Logical decoding on standby

On 12/22/2016 01:21 AM, Craig Ringer wrote:

+   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
@fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
'$slot_name'");
+   $result = undef if $result eq '';
+   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
Couldn't this portion be made more generic? Other queries could
benefit from that by having a routine that accepts as argument an
array of column names for example.

Yeah, probably. I'm not sure where it should live though - TestLib.pm ?

Not sure if there's an idomatic way to pass a string (in this case
queyr) in Perl with a placeholder for interpolation of values (in this
case columns). in Python you'd pass it with pre-defined
%(placeholders)s for %.

For direct interpolation of an expression, there is this slightly
baroque gadget:

my $str = "here it is @{[ arbitrary expression here ]}";

For indirect interpolation using placeholders, there is

my $str = sprintf("format string",...);

which works much like C except that the string is returned as a result
instead of being the first argument.

And as we always say, TIMTOWTDI.

cheers

andrew (japh)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#20)
9 attachment(s)
Re: Logical decoding on standby

On 23 December 2016 at 18:11, Craig Ringer <craig@2ndquadrant.com> wrote:

On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote:

changes-in-0001-v2.diff shows the changes to PostgresNode.pm per
Michael's comments, and applies on top of 0001.

I also attach a patch to add a new CREATE_REPLICATION_SLOT option per
Petr's suggestion, so you can request a slot be created
WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of
silently suppressing snapshot export when a slot was created on a
replica. It'll conflict (easily resolved) if applied on top of the
current series.

OK, patch series updated.

0001 incorporates the changes requested by Michael Paquier. Simon
expressed his intention to commit this after updates, in the separate
thread for

The pg_lsn patch is gone; I worked around it using the server to work with LSNs.

0002 (endpos) is unchanged.

0003 is new, some minimal tests for pg_recvlogical. It can be squashed
with 0002 (pg_recvlogical --endpos) if desired.

0004 (pg_recvlogical wrapper) is unchanged.

0005 (new streaming rep tests) is updated for the changes in 0001,
otherwise unchanged. Simon said he wanted to commit this soon.

0006 (timeline following) is unchanged except for updates to be
compatible with 0001.

0007 is the optional snapshot export requested by Petr.

0008 is unchanged.

0009 is unchanged except for updates vs 0001 and use of the
WITHOUT_SNAPSHOT option added in 0007.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-PostgresNode-methods-to-wait-for-node-catchup.patchtext/x-patch; charset=US-ASCII; name=0001-PostgresNode-methods-to-wait-for-node-catchup.patchDownload
From 79241c8052fbd1ecd079e98fd8564e4b2fcf797b Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 14 Nov 2016 12:27:17 +0800
Subject: [PATCH 01/10] PostgresNode methods to wait for node catchup

---
 src/test/perl/PostgresNode.pm              | 172 ++++++++++++++++++++++++++++-
 src/test/recovery/t/001_stream_rep.pl      |  12 +-
 src/test/recovery/t/004_timeline_switch.pl |  16 +--
 3 files changed, 175 insertions(+), 25 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index c1b16ca..2f009d4 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1121,7 +1121,6 @@ sub psql
 		my $exc_save = $@;
 		if ($exc_save)
 		{
-
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
@@ -1173,7 +1172,7 @@ sub psql
 		  if $ret == 1;
 		die "connection error: '$$stderr'\nwhile running '@psql_params'"
 		  if $ret == 2;
-		die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
+		die "error running SQL: '$$stderr'\nwhile running '@psql_params' with sql '$sql'"
 		  if $ret == 3;
 		die "psql returns $ret: '$$stderr'\nwhile running '@psql_params'";
 	}
@@ -1325,6 +1324,175 @@ sub run_log
 	TestLib::run_log(@_);
 }
 
+=pod $node->lsn(mode)
+
+Look up xlog positions on the server:
+
+* insert position (master only, error on replica)
+* write position (master only, error on replica)
+* flush position
+* receive position (always undef on master)
+* replay position
+
+mode must be specified.
+
+=cut
+
+sub lsn
+{
+	my ($self, $mode) = @_;
+	my %modes = ('insert' => 'pg_current_xlog_insert_location()',
+				 'flush' => 'pg_current_xlog_flush_location()',
+				 'write' => 'pg_current_xlog_location()',
+				 'receive' => 'pg_last_xlog_receive_location()',
+				 'replay' => 'pg_last_xlog_replay_location()');
+
+	$mode = '<undef>' if !defined($mode);
+	die "unknown mode for 'lsn': '$mode', valid modes are " . join(', ', keys %modes)
+		if !defined($modes{$mode});
+
+	my $result = $self->safe_psql('postgres', "SELECT $modes{$mode}");
+	chomp($result);
+	if ($result eq '')
+	{
+		return undef;
+	}
+	else
+	{
+		return $result;
+	}
+}
+
+=pod $node->wait_for_catchup(standby_name, mode, target_lsn)
+
+Wait for the node with application_name standby_name (usually from node->name)
+until its replication position in pg_stat_replication equals or passes the
+upstream's xlog insert point at the time this function is called. By default
+the replay_location is waited for, but 'mode' may be specified to wait for any
+of sent|write|flush|replay.
+
+If there is no active replication connection from this peer, waits until
+poll_query_until timeout.
+
+Requires that the 'postgres' db exists and is accessible.
+
+target_lsn may be any arbitrary lsn, but is typically $master_node->lsn('insert').
+
+This is not a test. It die()s on failure.
+
+=cut
+
+sub wait_for_catchup
+{
+	my ($self, $standby_name, $mode, $target_lsn) = @_;
+	$mode = defined($mode) ? $mode : 'replay';
+	my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1 );
+	die "unknown mode $mode for 'wait_for_catchup', valid modes are " . join(', ', keys(%valid_modes)) unless exists($valid_modes{$mode});
+	# Allow passing of a PostgresNode instance as shorthand
+	if ( blessed( $standby_name ) && $standby_name->isa("PostgresNode") )
+	{
+		$standby_name = $standby_name->name;
+	}
+	die 'target_lsn must be specified' unless defined($target_lsn);
+	print "Waiting for replication conn " . $standby_name . "'s " . $mode . "_location to pass " . $target_lsn . " on " . $self->name . "\n";
+	my $query = qq[SELECT '$target_lsn' <= ${mode}_location FROM pg_catalog.pg_stat_replication WHERE application_name = '$standby_name';];
+	$self->poll_query_until('postgres', $query)
+		or die "timed out waiting for catchup, current position is " . ($self->safe_psql('postgres', $query) || '(unknown)');
+	print "done\n";
+}
+
+=pod $node->wait_for_slot_catchup(slot_name, mode, target_lsn)
+
+Wait for the named replication slot to equal or pass the supplied target_lsn.
+The position used is the restart_lsn unless mode is given, in which case it may
+be 'restart' or 'confirmed_flush'.
+
+Requires that the 'postgres' db exists and is accessible.
+
+This is not a test. It die()s on failure.
+
+If the slot is not active, will time out after poll_query_until's timeout.
+
+target_lsn may be any arbitrary lsn, but is typically $master_node->lsn('insert').
+
+Note that for logical slots, restart_lsn is held down by the oldest in-progress tx.
+
+=cut
+
+sub wait_for_slot_catchup
+{
+	my ($self, $slot_name, $mode, $target_lsn) = @_;
+	$mode = defined($mode) ? $mode : 'restart';
+	if (!($mode eq 'restart' || $mode eq 'confirmed_flush'))
+	{
+		die "valid modes are restart, confirmed_flush";
+	}
+	die 'target lsn must be specified' unless defined($target_lsn);
+	print "Waiting for replication slot " . $slot_name . "'s " . $mode . "_lsn to pass " . $target_lsn . " on " . $self->name . "\n";
+	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name';];
+	$self->poll_query_until('postgres', $query)
+		or die "timed out waiting for catchup, current position is " . ($self->safe_psql('postgres', $query) || '(unknown)');
+	print "done\n";
+}
+
+=pod $node->query_hash($dbname, $query, @columns)
+
+Execute $query on $dbname, replacing any appearance of the string __COLUMNS__
+within the query with a comma-separated list of @columns.
+
+If __COLUMNS__ does not appear in the query, its result columns must EXACTLY
+match the order and number (but not necessarily alias) of supplied @columns.
+
+The query must return zero or one rows.
+
+Return a hash-ref representation of the results of the query, with any empty
+or null results as defined keys with an empty-string value. There is no way
+to differentiate between null and empty-string result fields.
+
+If the query returns zero rows, return a hash with all columns empty. There
+is no way to differentiate between zero rows returned and a row with only
+null columns.
+
+=cut
+
+sub query_hash
+{
+	my ($self, $dbname, $query, @columns) = @_;
+	die 'calls in array context for multi-row results not supported yet' if (wantarray);
+	# Replace __COLUMNS__ if found
+	substr($query, index($query, '__COLUMNS__'), length('__COLUMNS__')) = join(', ', @columns)
+		if index($query, '__COLUMNS__') >= 0;
+	my $result = $self->safe_psql($dbname, $query);
+	# hash slice, see http://stackoverflow.com/a/16755894/398670 .
+	#
+	# Fills the hash with empty strings produced by x-operator element
+	# duplication if result is an empty row
+	#
+	my %val;
+	@val{@columns} = $result ne '' ? split(qr/\|/, $result) : ('',) x scalar(@columns);
+	return \%val;
+}
+
+=pod $node->slot(slot_name)
+
+Return hash-ref of replication slot data for the named slot, or a hash-ref with
+all values '' if not found. Does not differentiate between null and empty string
+for fields, no field is ever undef.
+
+The restart_lsn and confirmed_flush_lsn fields are returned verbatim, and also
+as a 2-list of [highword, lowword] integer. Since we rely on Perl 5.8.8 we can't
+"use bigint", it's from 5.20, and we can't assume we have Math::Bigint from CPAN
+either.
+
+=cut
+
+sub slot
+{
+	my ($self, $slot_name) = @_;
+	my @columns = ('plugin', 'slot_type', 'datoid', 'database', 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+	return $self->query_hash('postgres', "SELECT __COLUMNS__ FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'", @columns);
+}
+
 =pod
 
 =back
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 981c00b..ba1da8c 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -40,16 +40,8 @@ $node_master->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
 
 # Wait for standbys to catch up
-my $applname_1 = $node_standby_1->name;
-my $applname_2 = $node_standby_2->name;
-my $caughtup_query =
-"SELECT pg_current_xlog_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_1';";
-$node_master->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby 1 to catch up";
-$caughtup_query =
-"SELECT pg_last_xlog_replay_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_2';";
-$node_standby_1->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby 2 to catch up";
+$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('insert'));
+$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('replay'));
 
 my $result =
   $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index 5f3b2fe..7c6587a 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -32,14 +32,9 @@ $node_standby_2->start;
 # Create some content on master
 $node_master->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $until_lsn =
-  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
 
 # Wait until standby has replayed enough data on standby 1
-my $caughtup_query =
-  "SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
-$node_standby_1->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby to catch up";
+$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('write'));
 
 # Stop and remove master, and promote standby 1, switching it to a new timeline
 $node_master->teardown_node;
@@ -50,7 +45,7 @@ rmtree($node_standby_2->data_dir . '/recovery.conf');
 my $connstr_1 = $node_standby_1->connstr;
 $node_standby_2->append_conf(
 	'recovery.conf', qq(
-primary_conninfo='$connstr_1'
+primary_conninfo='$connstr_1 application_name=@{[$node_standby_2->name]}'
 standby_mode=on
 recovery_target_timeline='latest'
 ));
@@ -60,12 +55,7 @@ $node_standby_2->restart;
 # to ensure that the timeline switch has been done.
 $node_standby_1->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-$until_lsn = $node_standby_1->safe_psql('postgres',
-	"SELECT pg_current_xlog_location();");
-$caughtup_query =
-  "SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
-$node_standby_2->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby to catch up";
+$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('write'));
 
 my $result =
   $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-- 
2.5.5

0002-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patchtext/x-patch; charset=UTF-8; name=0002-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patchDownload
From e746b2de8ddcab988cf4196ace059902df90ac9e Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 12:37:40 +0800
Subject: [PATCH 02/10] Add an optional --endpos LSN argument to pg_recvlogical
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

pg_recvlogical usually just runs until cancelled or until the upstream
server disconnects. For some purposes, especially testing, it's useful
to have the ability to stop receive at a specified LSN without having
to parse the output and deal with buffering issues, etc.

Add a --endpos parameter that takes the LSN at which no further
messages should be written and receive should stop.

Craig Ringer, Álvaro Herrera
---
 doc/src/sgml/ref/pg_recvlogical.sgml   |  34 ++++++++
 src/bin/pg_basebackup/pg_recvlogical.c | 145 +++++++++++++++++++++++++++++----
 2 files changed, 164 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index b35881f..d066ce8 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -38,6 +38,14 @@ PostgreSQL documentation
    constraints as <xref linkend="app-pgreceivexlog">, plus those for logical
    replication (see <xref linkend="logicaldecoding">).
   </para>
+
+  <para>
+   <command>pg_recvlogical</> has no equivalent to the logical decoding
+   SQL interface's peek and get modes. It sends replay confirmations for
+   data lazily as it receives it and on clean exit. To examine pending data on
+    a slot without consuming it, use
+   <link linkend="functions-replication"><function>pg_logical_slot_peek_changes</></>.
+  </para>
  </refsect1>
 
  <refsect1>
@@ -155,6 +163,32 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-E <replaceable>lsn</replaceable></option></term>
+      <term><option>--endpos=<replaceable>lsn</replaceable></option></term>
+      <listitem>
+       <para>
+        In <option>--start</option> mode, automatically stop replication
+        and exit with normal exit status 0 when receiving reaches the
+        specified LSN.  If specified when not in <option>--start</option>
+        mode, an error is raised.
+       </para>
+
+       <para>
+        If there's a record with LSN exactly equal to <replaceable>lsn</>,
+        the record will be output.
+       </para>
+
+       <para>
+        The <option>--endpos</option> option is not aware of transaction
+        boundaries and may truncate output partway through a transaction.
+        Any partially output transaction will not be consumed and will be
+        replayed again when the slot is next read from. Individual messages
+        are never truncated.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>--if-not-exists</option></term>
       <listitem>
        <para>
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index cb5f989..4e6a8c2 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -40,6 +40,7 @@ static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
 static XLogRecPtr startpos = InvalidXLogRecPtr;
+static XLogRecPtr endpos = InvalidXLogRecPtr;
 static bool do_create_slot = false;
 static bool slot_exists_ok = false;
 static bool do_start_slot = false;
@@ -63,6 +64,9 @@ static XLogRecPtr output_fsync_lsn = InvalidXLogRecPtr;
 static void usage(void);
 static void StreamLogicalLog(void);
 static void disconnect_and_exit(int code);
+static bool flushAndSendFeedback(PGconn *conn, TimestampTz *now);
+static void prepareToTerminate(PGconn *conn, XLogRecPtr endpos,
+				   bool keepalive, XLogRecPtr lsn);
 
 static void
 usage(void)
@@ -81,6 +85,7 @@ usage(void)
 			 "                         time between fsyncs to the output file (default: %d)\n"), (fsync_interval / 1000));
 	printf(_("      --if-not-exists    do not error if slot already exists when creating a slot\n"));
 	printf(_("  -I, --startpos=LSN     where in an existing slot should the streaming start\n"));
+	printf(_("  -E, --endpos=LSN       exit after receiving the specified LSN\n"));
 	printf(_("  -n, --no-loop          do not loop on connection lost\n"));
 	printf(_("  -o, --option=NAME[=VALUE]\n"
 			 "                         pass option NAME with optional value VALUE to the\n"
@@ -281,6 +286,7 @@ StreamLogicalLog(void)
 		int			bytes_written;
 		int64		now;
 		int			hdr_len;
+		XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
 
 		if (copybuf != NULL)
 		{
@@ -454,6 +460,7 @@ StreamLogicalLog(void)
 			int			pos;
 			bool		replyRequested;
 			XLogRecPtr	walEnd;
+			bool		endposReached = false;
 
 			/*
 			 * Parse the keepalive message, enclosed in the CopyData message.
@@ -476,18 +483,32 @@ StreamLogicalLog(void)
 			}
 			replyRequested = copybuf[pos];
 
-			/* If the server requested an immediate reply, send one. */
-			if (replyRequested)
+			if (endpos != InvalidXLogRecPtr && walEnd >= endpos)
 			{
-				/* fsync data, so we send a recent flush pointer */
-				if (!OutputFsync(now))
-					goto error;
+				/*
+				 * If there's nothing to read on the socket until a keepalive
+				 * we know that the server has nothing to send us; and if
+				 * walEnd has passed endpos, we know nothing else can have
+				 * committed before endpos.  So we can bail out now.
+				 */
+				endposReached = true;
+			}
 
-				now = feGetCurrentTimestamp();
-				if (!sendFeedback(conn, now, true, false))
+			/* Send a reply, if necessary */
+			if (replyRequested || endposReached)
+			{
+				if (!flushAndSendFeedback(conn, &now))
 					goto error;
 				last_status = now;
 			}
+
+			if (endposReached)
+			{
+				prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr);
+				time_to_abort = true;
+				break;
+			}
+
 			continue;
 		}
 		else if (copybuf[0] != 'w')
@@ -497,7 +518,6 @@ StreamLogicalLog(void)
 			goto error;
 		}
 
-
 		/*
 		 * Read the header of the XLogData message, enclosed in the CopyData
 		 * message. We only need the WAL location field (dataStart), the rest
@@ -515,12 +535,23 @@ StreamLogicalLog(void)
 		}
 
 		/* Extract WAL location for this block */
-		{
-			XLogRecPtr	temp = fe_recvint64(&copybuf[1]);
+		cur_record_lsn = fe_recvint64(&copybuf[1]);
 
-			output_written_lsn = Max(temp, output_written_lsn);
+		if (endpos != InvalidXLogRecPtr && cur_record_lsn > endpos)
+		{
+			/*
+			 * We've read past our endpoint, so prepare to go away being
+			 * cautious about what happens to our output data.
+			 */
+			if (!flushAndSendFeedback(conn, &now))
+				goto error;
+			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			time_to_abort = true;
+			break;
 		}
 
+		output_written_lsn = Max(cur_record_lsn, output_written_lsn);
+
 		bytes_left = r - hdr_len;
 		bytes_written = 0;
 
@@ -557,10 +588,29 @@ StreamLogicalLog(void)
 					strerror(errno));
 			goto error;
 		}
+
+		if (endpos != InvalidXLogRecPtr && cur_record_lsn == endpos)
+		{
+			/* endpos was exactly the record we just processed, we're done */
+			if (!flushAndSendFeedback(conn, &now))
+				goto error;
+			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			time_to_abort = true;
+			break;
+		}
 	}
 
 	res = PQgetResult(conn);
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	if (PQresultStatus(res) == PGRES_COPY_OUT)
+	{
+		/*
+		 * We're doing a client-initiated clean exit and have sent CopyDone to
+		 * the server. We've already sent replay confirmation and fsync'd so
+		 * we can just clean up the connection now.
+		 */
+		goto error;
+	}
+	else if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
 		fprintf(stderr,
 				_("%s: unexpected termination of replication stream: %s"),
@@ -638,6 +688,7 @@ main(int argc, char **argv)
 		{"password", no_argument, NULL, 'W'},
 /* replication options */
 		{"startpos", required_argument, NULL, 'I'},
+		{"endpos", required_argument, NULL, 'E'},
 		{"option", required_argument, NULL, 'o'},
 		{"plugin", required_argument, NULL, 'P'},
 		{"status-interval", required_argument, NULL, 's'},
@@ -673,7 +724,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "f:F:nvd:h:p:U:wWI:E:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -733,6 +784,16 @@ main(int argc, char **argv)
 				}
 				startpos = ((uint64) hi) << 32 | lo;
 				break;
+			case 'E':
+				if (sscanf(optarg, "%X/%X", &hi, &lo) != 2)
+				{
+					fprintf(stderr,
+							_("%s: could not parse end position \"%s\"\n"),
+							progname, optarg);
+					exit(1);
+				}
+				endpos = ((uint64) hi) << 32 | lo;
+				break;
 			case 'o':
 				{
 					char	   *data = pg_strdup(optarg);
@@ -857,6 +918,16 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (endpos != InvalidXLogRecPtr && !do_start_slot)
+	{
+		fprintf(stderr,
+				_("%s: --endpos may only be specified with --start\n"),
+				progname);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -923,8 +994,8 @@ main(int argc, char **argv)
 		if (time_to_abort)
 		{
 			/*
-			 * We've been Ctrl-C'ed. That's not an error, so exit without an
-			 * errorcode.
+			 * We've been Ctrl-C'ed or reached an exit limit condition. That's
+			 * not an error, so exit without an errorcode.
 			 */
 			disconnect_and_exit(0);
 		}
@@ -943,3 +1014,47 @@ main(int argc, char **argv)
 		}
 	}
 }
+
+/*
+ * Fsync our output data, and send a feedback message to the server.  Returns
+ * true if successful, false otherwise.
+ *
+ * If successful, *now is updated to the current timestamp just before sending
+ * feedback.
+ */
+static bool
+flushAndSendFeedback(PGconn *conn, TimestampTz *now)
+{
+	/* flush data to disk, so that we send a recent flush pointer */
+	if (!OutputFsync(*now))
+		return false;
+	*now = feGetCurrentTimestamp();
+	if (!sendFeedback(conn, *now, true, false))
+		return false;
+
+	return true;
+}
+
+/*
+ * Try to inform the server about of upcoming demise, but don't wait around or
+ * retry on failure.
+ */
+static void
+prepareToTerminate(PGconn *conn, XLogRecPtr endpos, bool keepalive, XLogRecPtr lsn)
+{
+	(void) PQputCopyEnd(conn, NULL);
+	(void) PQflush(conn);
+
+	if (verbose)
+	{
+		if (keepalive)
+			fprintf(stderr, "%s: endpos %X/%X reached by keepalive\n",
+					progname,
+					(uint32) (endpos >> 32), (uint32) endpos);
+		else
+			fprintf(stderr, "%s: endpos %X/%X reached by record at %X/%X\n",
+					progname, (uint32) (endpos >> 32), (uint32) (endpos),
+					(uint32) (lsn >> 32), (uint32) lsn);
+
+	}
+}
-- 
2.5.5

0003-Add-some-minimal-tests-for-pg_recvlogical.patchtext/x-patch; charset=US-ASCII; name=0003-Add-some-minimal-tests-for-pg_recvlogical.patchDownload
From c29990d14c5a665315ce179743e3074a3df7156f Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 3 Jan 2017 18:21:48 +0800
Subject: [PATCH 03/10] Add some minimal tests for pg_recvlogical

---
 src/bin/pg_basebackup/Makefile                |  2 ++
 src/bin/pg_basebackup/t/030_pg_recvlogical.pl | 46 +++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)
 create mode 100644 src/bin/pg_basebackup/t/030_pg_recvlogical.pl

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 52ac9e9..1e54b19 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -12,6 +12,8 @@
 PGFILEDESC = "pg_basebackup/pg_receivexlog/pg_recvlogical - streaming WAL and backup receivers"
 PGAPPICON=win32
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/bin/pg_basebackup
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
new file mode 100644
index 0000000..dca5ef2
--- /dev/null
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -0,0 +1,46 @@
+use strict;
+use warnings;
+use TestLib;
+use PostgresNode;
+use Test::More tests => 15;
+
+program_help_ok('pg_recvlogical');
+program_version_ok('pg_recvlogical');
+program_options_handling_ok('pg_recvlogical');
+
+my $node = get_new_node('main');
+
+# Initialize node without replication settings
+$node->init(allows_streaming => 1, has_archiving => 1);
+$node->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug1'
+log_error_verbosity = verbose
+});
+$node->dump_info;
+$node->start;
+
+$node->command_fails(['pg_recvlogical'],
+	'pg_recvlogical needs a slot name');
+$node->command_fails(['pg_recvlogical', '-S', 'test'],
+	'pg_recvlogical needs a database');
+$node->command_fails(['pg_recvlogical', '-S', 'test', '-d', 'postgres'],
+	'pg_recvlogical needs an action');
+$node->command_fails(['pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'), '--start'],
+	'no destionation file');
+
+$node->command_ok(['pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'), '--create-slot'],
+	'slot created');
+
+my $slot = $node->slot('test');
+isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot');
+
+$node->psql('postgres', 'CREATE TABLE test_table(x integer)');
+$node->psql('postgres', 'INSERT INTO test_table(x) SELECT y FROM generate_series(1, 10) a(y);');
+my $nextlsn = $node->safe_psql('postgres', 'SELECT pg_current_xlog_insert_location()');
+chomp($nextlsn);
+
+$node->command_ok(['pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'), '--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'],
+	'replayed a transaction');
-- 
2.5.5

0004-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchtext/x-patch; charset=US-ASCII; name=0004-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchDownload
From 4f1e2bad038bb5c158739eb58ba1d60bfb7033b3 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 15 Nov 2016 16:06:16 +0800
Subject: [PATCH 04/10] Add a pg_recvlogical wrapper to PostgresNode

---
 src/test/perl/PostgresNode.pm               | 75 ++++++++++++++++++++++++++++-
 src/test/recovery/t/006_logical_decoding.pl | 31 +++++++++++-
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2f009d4..5197e80 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1124,7 +1124,7 @@ sub psql
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
-			  if (blessed($exc_save) || $exc_save ne $timeout_exception);
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
 
 			$ret = undef;
 
@@ -1493,6 +1493,79 @@ sub slot
 	return $self->query_hash('postgres', "SELECT __COLUMNS__ FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'", @columns);
 }
 
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...)
+
+Invoke pg_recvlogical to read from slot_name on dbname until LSN endpos, which
+corresponds to pg_recvlogical --endpos.  Gives up after timeout (if nonzero).
+
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
+
+Plugin options are passed as additional keyword arguments.
+
+If called in scalar context, returns stdout, and die()s on timeout or nonzero return.
+
+If called in array context, returns a tuple of (retval, stdout, stderr, timeout).
+timeout is the IPC::Run::Timeout object whose is_expired method can be tested
+to check for timeout. retval is undef on timeout.
+
+=cut
+
+sub pg_recvlogical_upto
+{
+	my ($self, $dbname, $slot_name, $endpos, $timeout_secs, %plugin_options) = @_;
+	my ($stdout, $stderr);
+
+	my $timeout_exception = 'pg_recvlogical timed out';
+
+	my @cmd = ('pg_recvlogical', '-S', $slot_name, '--dbname', $self->connstr($dbname));
+	push @cmd, '--endpos', $endpos if ($endpos);
+	push @cmd, '-f', '-', '--no-loop', '--start';
+
+	while (my ($k, $v) = each %plugin_options)
+	{
+		die "= is not permitted to appear in replication option name" if ($k =~ qr/=/);
+		push @cmd, "-o", "$k=$v";
+	}
+
+	my $timeout;
+	$timeout = IPC::Run::timeout($timeout_secs, exception => $timeout_exception ) if $timeout_secs;
+	my $ret = 0;
+
+	do {
+		local $@;
+		eval {
+			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
+			$ret = $?;
+		};
+		my $exc_save = $@;
+		if ($exc_save)
+		{
+			# IPC::Run::run threw an exception. re-throw unless it's a
+			# timeout, which we'll handle by testing is_expired
+			die $exc_save
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
+
+			$ret = undef;
+
+			die "Got timeout exception '$exc_save' but timer not expired?!"
+			  unless $timeout->is_expired;
+
+			die "$exc_save waiting for endpos $endpos with stdout '$stdout', stderr '$stderr'"
+				unless wantarray;
+		}
+	};
+
+	if (wantarray)
+	{
+		return ($ret, $stdout, $stderr, $timeout);
+	}
+	else
+	{
+		die "pg_recvlogical exited with code '$ret', stdout '$stdout' and stderr '$stderr'" if $ret;
+		return $stdout;
+	}
+}
+
 =pod
 
 =back
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index b80a9a9..d8cc8d3 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -1,9 +1,13 @@
 # Testing of logical decoding using SQL interface and/or pg_recvlogical
+#
+# Most logical decoding tests are in contrib/test_decoding. This module
+# is for work that doesn't fit well there, like where server restarts
+# are required.
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -36,5 +40,30 @@ $result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_chan
 chomp($result);
 is($result, '', 'Decoding after fast restart repeats no rows');
 
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
+
+my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
+
+$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
+
 # done with the node
 $node_master->stop;
-- 
2.5.5

0005-Expand-streaming-replication-tests-to-cover-hot-stan.patchtext/x-patch; charset=US-ASCII; name=0005-Expand-streaming-replication-tests-to-cover-hot-stan.patchDownload
From ba05ef5fdf0955de3fc06cca1f8ed9ee0ad2d3a7 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 9 Nov 2016 13:44:04 +0800
Subject: [PATCH 05/10] Expand streaming replication tests to cover hot standby
 feedback and physical replication slots

---
 src/test/recovery/t/001_stream_rep.pl | 105 +++++++++++++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ba1da8c..eef512d 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 4;
+use Test::More tests => 22;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -58,3 +58,106 @@ is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
 is($node_standby_2->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 2');
+
+diag "switching to physical replication slot";
+# Switch to using a physical replication slot. We can do this without a new
+# backup since physical slots can go backwards if needed. Do so on both
+# standbys. Since we're going to be testing things that affect the slot state,
+# also increase the standby feedback interval to ensure timely updates.
+my ($slotname_1, $slotname_2) = ('standby_1', 'standby_2');
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_master->restart;
+is($node_master->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_1');]), 0, 'physical slot created on master');
+$node_standby_1->append_conf('recovery.conf', "primary_slot_name = $slotname_1\n");
+$node_standby_1->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
+$node_standby_1->restart;
+is($node_standby_1->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_2');]), 0, 'physical slot created on intermediate replica');
+$node_standby_2->append_conf('recovery.conf', "primary_slot_name = $slotname_2\n");
+$node_standby_2->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
+$node_standby_2->restart;
+
+sub get_slot_xmins
+{
+	my ($node, $slotname) = @_;
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+# There's no hot standby feedback and there are no logical slots on either peer
+# so xmin and catalog_xmin should be null on both slots.
+my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
+is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
+is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
+
+# Replication still works?
+$node_master->safe_psql('postgres', 'CREATE TABLE replayed(val integer);');
+
+sub replay_check
+{
+	my $newval = $node_master->safe_psql('postgres', 'INSERT INTO replayed(val) SELECT coalesce(max(val),0) + 1 AS newval FROM replayed RETURNING val');
+	$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('insert'));
+	$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('replay'));
+	$node_standby_1->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
+		or die "standby_1 didn't replay master value $newval";
+	$node_standby_2->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
+		or die "standby_2 didn't replay standby_1 value $newval";
+}
+
+replay_check();
+
+diag "enabling hot_standby_feedback";
+# Enable hs_feedback. The slot should gain an xmin. We set the status interval
+# so we'll see the results promptly.
+$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_1->reload;
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_2->reload;
+replay_check();
+sleep(2);
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+isnt($xmin, '', 'non-cascaded slot xmin non-null with hs feedback');
+is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+isnt($xmin, '', 'cascaded slot xmin non-null with hs feedback');
+is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback');
+
+diag "doing some work to advance xmin";
+for my $i (10000..11000) {
+	$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES ($i);]);
+}
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my ($xmin2, $catalog_xmin2) = get_slot_xmins($node_master, $slotname_1);
+diag "new xmin $xmin2, old xmin $xmin";
+isnt($xmin2, $xmin, 'non-cascaded slot xmin with hs feedback has changed');
+is($catalog_xmin2, '', 'non-cascaded slot xmin still null with hs_feedback unchanged');
+
+($xmin2, $catalog_xmin2) = get_slot_xmins($node_standby_1, $slotname_2);
+diag "new xmin $xmin2, old xmin $xmin";
+isnt($xmin2, $xmin, 'cascaded slot xmin with hs feedback has changed');
+is($catalog_xmin2, '', 'cascaded slot xmin still null with hs_feedback unchanged');
+
+diag "disabling hot_standby_feedback";
+# Disable hs_feedback. Xmin should be cleared.
+$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_1->reload;
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_2->reload;
+replay_check();
+sleep(2);
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
+is($xmin, '', 'non-cascaded slot xmin null with hs feedback reset');
+is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback reset');
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+is($xmin, '', 'cascaded slot xmin null with hs feedback reset');
+is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback reset');
-- 
2.5.5

0006-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0006-Follow-timeline-switches-in-logical-decoding.patchDownload
From 7ad6f7c0127c1d6cdbb7ce6ab55a28d3d07933fd Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 06/10] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 200 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   7 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 347 insertions(+), 22 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 51a8e8d..ab15cf3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -660,6 +661,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -675,7 +677,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -702,6 +705,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -750,6 +754,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -770,28 +897,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= read_upto)
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
+
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 318726e..a8f7b76 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -234,13 +234,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
+	ReplicationSlotAcquire(NameStr(*name));
+
 	/* compute the current end-of-wal */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
-	ReplicationSlotAcquire(NameStr(*name));
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	PG_TRY();
 	{
@@ -279,6 +279,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5cdb8a0..acb3370 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -760,6 +761,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -992,10 +999,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index deaa7f5..8f96728 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -160,6 +160,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 * 
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index d027ea1..f0ee352 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index a847952..d2ff1e9 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0007-Make-snapshot-export-on-logical-slot-creation-option.patchtext/x-patch; charset=US-ASCII; name=0007-Make-snapshot-export-on-logical-slot-creation-option.patchDownload
From 5cdf9953cdbf38fca3bb8e7c19d6e1425b456877 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 21 Dec 2016 11:21:46 +0800
Subject: [PATCH 07/10] Make snapshot export on logical slot creation optional

Allow logical decoding slot creation via the walsender protocol's
CREATE_REPLICATION_SLOT command to optionally suppress exporting of
a snapshot when the WITHOUT_SNAPSHOT option is passed.

This means that when we allow creation of replication slots on standbys, which
cannot export snapshots, we don't have to silently omit the snapshot creation.
It also allows clients like pg_recvlogical, which neither need nor can use the
exported snapshot, to suppress its creation. Since snapshot exporting can fail
this improves reliability.
---
 doc/src/sgml/logicaldecoding.sgml      | 13 ++++++++++---
 doc/src/sgml/protocol.sgml             | 17 +++++++++++++++--
 src/backend/replication/repl_gram.y    | 13 ++++++++++---
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    |  9 ++++++++-
 src/bin/pg_basebackup/streamutil.c     |  5 +++++
 src/include/nodes/replnodes.h          |  1 +
 7 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 484915d..c0b6987 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -268,11 +268,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </para>
    </sect2>
 
-   <sect2>
+   <sect2 id="logicaldecoding-snapshot-export" xreflabel="Exported Snapshots (Logical Decoding)">
     <title>Exported Snapshots</title>
     <para>
-     When a new replication slot is created using the streaming replication interface,
-     a snapshot is exported
+     When <link linkend="protocol-replication-create-slot">a new replication
+     slot is created using the streaming replication interface</>, a snapshot
+     is exported
      (see <xref linkend="functions-snapshot-synchronization">), which will show
      exactly the state of the database after which all changes will be
      included in the change stream. This can be used to create a new replica by
@@ -282,6 +283,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
      database's state at that point in time, which afterwards can be updated
      using the slot's contents without losing any changes.
     </para>
+    <para>
+     Creation of a snapshot is not always possible - in particular, it will
+     fail when connected to a hot standby. Applications that do not require
+     snapshot export may suppress it with the <literal>WITHOUT_SNAPSHOT</>
+     option.
+    </para>
    </sect2>
   </sect1>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 9ba147c..e41c650 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1433,8 +1433,8 @@ The commands accepted in walsender mode are:
     </listitem>
   </varlistentry>
 
-  <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+  <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> [<literal>WITHOUT_SNAPSHOT</>] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1485,6 +1485,19 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>WITHOUT_SNAPSHOT</></term>
+       <listitem>
+        <para>
+         By default, logical replication slot creation exports a snapshot for
+         use in initialization; see <xref linkend="logicaldecoding-snapshot-export">.
+         Because not all clients need an exported snapshot its creation can
+         be suppressed with <literal>WITHOUT_SNAPSHOT</>.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
      </variablelist>
     </listitem>
   </varlistentry>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 8cc9edd..edf2ca4 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -78,6 +78,7 @@ Node *replication_parse_result;
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_WITHOUT_SNAPSHOT
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,7 +91,7 @@ Node *replication_parse_result;
 %type <defelt>	plugin_opt_elem
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
-%type <boolval>	opt_reserve_wal opt_temporary
+%type <boolval>	opt_reserve_wal opt_temporary opt_without_snapshot
 
 %%
 
@@ -194,8 +195,8 @@ create_replication_slot:
 					cmd->reserve_wal = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin WITHOUT_SNAPSHOT */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT opt_without_snapshot
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
@@ -203,6 +204,7 @@ create_replication_slot:
 					cmd->slotname = $2;
 					cmd->temporary = $3;
 					cmd->plugin = $5;
+					cmd->without_snapshot = $6;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -283,6 +285,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_without_snapshot:
+			K_WITHOUT_SNAPSHOT				{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 9f50ce6..9874f18 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -96,6 +96,7 @@ DROP_REPLICATION_SLOT		{ return K_DROP_REPLICATION_SLOT; }
 TIMELINE_HISTORY	{ return K_TIMELINE_HISTORY; }
 PHYSICAL			{ return K_PHYSICAL; }
 RESERVE_WAL			{ return K_RESERVE_WAL; }
+WITHOUT_SNAPSHOT	{ return K_WITHOUT_SNAPSHOT; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index acb3370..04f9adb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -849,7 +849,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * Export a plain (not of the snapbuild.c type) snapshot to the user
 		 * that can be imported into another session.
 		 */
-		snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		if (cmd->without_snapshot)
+			snapshot_name = "";
+		else if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot export a snapshot from a standby")));
+		else
+			snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
 
 		/* don't need the decoding context anymore */
 		FreeDecodingContext(ctx);
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 595eaff..7b1b2ee 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -346,8 +346,13 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" PHYSICAL",
 						  slot_name);
 	else
+	{
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" LOGICAL \"%s\"",
 						  slot_name, plugin);
+		if (PQserverVersion(conn) >= 100000)
+			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
+			appendPQExpBuffer(query, " WITHOUT_SNAPSHOT");
+	}
 
 	res = PQexec(conn, query->data);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 024b965..3504b59 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -57,6 +57,7 @@ typedef struct CreateReplicationSlotCmd
 	char	   *plugin;
 	bool		temporary;
 	bool		reserve_wal;
+	bool		without_snapshot;
 } CreateReplicationSlotCmd;
 
 
-- 
2.5.5

0008-ERROR-if-timeline-is-zero-in-walsender.patchtext/x-patch; charset=US-ASCII; name=0008-ERROR-if-timeline-is-zero-in-walsender.patchDownload
From 9b0a670878ece0b1bba71fb112df7c1e227c9e81 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 1 Jun 2016 13:50:52 +0800
Subject: [PATCH 08/10] ERROR if timeline is zero in walsender

---
 src/backend/replication/walsender.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 04f9adb..46976ce 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -523,6 +523,11 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 
+	if (ThisTimeLineID == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("run IDENTIFY_SYSTEM before trying to START_REPLICATION")));
+
 	/*
 	 * We assume here that we're logging enough information in the WAL for
 	 * log-shipping, since this is checked in PostmasterMain().
-- 
2.5.5

0009-Logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0009-Logical-decoding-on-standby.patchDownload
From 6d2e8ba8253d81588ea1331bc5b70992780c0e89 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 09/10] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
  exit with recovery conflict on upstream drop database when it has an active
  logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.
* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
  requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
  it in checkpoints, make vacuum and bgwriter advance it.
* During decoding startup check whether catalog_xmin requirement can be satisfied
  and bail out if it can not
* Add a new recovery conflict type for conflict with catalog_xmin. Abort
  in-progress logical decoding sessions with conflict with recovery where needed
  catalog_xmin is too old
* Make extra efforts to reserve master's catalog_xmin during decoding startup
  on standby.
* Try to make sure hot_standby_feedback is active when starting
  logical decoding.
* Remove checks preventing starting logical decoding on standby
---
 contrib/pg_visibility/pg_visibility.c              |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 +-
 src/backend/access/heap/heapam.c                   |   2 +-
 src/backend/access/heap/rewriteheap.c              |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c             |   9 +
 src/backend/access/transam/varsup.c                |  15 +
 src/backend/access/transam/xact.c                  |  55 +++
 src/backend/access/transam/xlog.c                  |  26 +-
 src/backend/catalog/index.c                        |   2 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/dbcommands.c                  |   6 +
 src/backend/commands/vacuum.c                      |  13 +-
 src/backend/postmaster/bgwriter.c                  |   9 +
 src/backend/postmaster/pgstat.c                    |   2 +
 src/backend/replication/logical/decode.c           |  11 +
 src/backend/replication/logical/logical.c          | 323 ++++++++++++++-
 src/backend/replication/slot.c                     |  91 ++++-
 src/backend/replication/walreceiver.c              |  52 ++-
 src/backend/replication/walsender.c                | 135 ++++--
 src/backend/storage/ipc/procarray.c                | 201 +++++++--
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  | 147 ++++++-
 src/backend/tcop/postgres.c                        |  38 +-
 src/bin/pg_controldata/pg_controldata.c            |   2 +
 src/include/access/transam.h                       |   5 +
 src/include/access/xact.h                          |  12 +-
 src/include/catalog/pg_control.h                   |   1 +
 src/include/pgstat.h                               |   3 +-
 src/include/replication/slot.h                     |   1 +
 src/include/replication/walreceiver.h              |   3 +
 src/include/storage/procarray.h                    |   9 +-
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +
 .../recovery/t/010_logical_decoding_on_replica.pl  | 454 +++++++++++++++++++++
 35 files changed, 1548 insertions(+), 129 deletions(-)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9985e3e..4fa3ad4 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index f524fc4..5b33c97 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e41c650..3b8f06f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1807,10 +1807,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1820,7 +1821,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled. New in 10.0.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby. New in 10.0.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ea579a0..d041e92 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7300,7 +7300,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 17584ba..c514b7b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -810,7 +810,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 91d27d0..f454d9d 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2f7e645..f786056 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -393,6 +393,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
+	elog(DEBUG1, "XXX advancing catalogXmin from %u to %u", ShmemVariableCache->oldestCatalogXmin, oldestCatalogXmin);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e47fd44..73f5fc0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5641,6 +5641,61 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Unless logical decoding is possible on this node, we don't care about
+		 * this record.
+		 */
+		if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+			return;
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+		/*
+		 * Notify any active logical decoding sessions to terminate if they
+		 * need the catalogs we're going to be allowed to remove after
+		 * replaying this record.
+		 */
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}
+
+	return ptr;
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f8ffa5c..4c39a36 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4840,6 +4840,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -4852,6 +4853,7 @@ BootStrapXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6430,6 +6432,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6446,6 +6451,7 @@ StartupXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8506,6 +8512,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -8709,7 +8716,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9072,7 +9079,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
@@ -9263,6 +9270,16 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		/*
+		 * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+		 * require oldestCatalogXmin in checkpoints and it no longer
+		 * makes sense, so update shmem and xlog the change. This will
+		 * get written out in the next checkpoint.
+		 */
+		if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+			wal_level < WAL_LEVEL_LOGICAL)
+			UpdateOldestCatalogXmin(true);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
@@ -9431,6 +9448,7 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9529,8 +9547,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 08b0989..6c08739 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2272,7 +2272,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f4afcd9..718ebba 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -993,7 +993,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 0919ad8..3efc833 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2119,11 +2119,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b1be2f7..f68673d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -488,6 +488,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin(false);
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
@@ -497,7 +506,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -909,7 +918,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 25020ab..6d49b0e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin(false);
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 61e6a2c..92d3601 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3303,6 +3303,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
+		case WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE:
+			event_name = "StandbyLogicalSlotCreate";
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 46cd5ba..5eaf42f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1512be5..9912800 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,10 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void WaitForMasterCatalogXminReservation(ReplicationSlot *slot);
+
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));
+
+		/*
+		 * When decoding on a standby we need a physical slot to be used by the
+		 * walrececiver so we can pin the upstream's catalog_xmin down even
+		 * over connection loss and restarts. This also gives us somewhere to
+		 * record our needed catalog xmin on the master.
+		 */
+		if (!walrcv_has_slot)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("no replication slot configured for connection to master"),
+					 errhint("Logical decoding on standby requires that a physical replication slot be used to connect the standby to the master.")));
+
+		/*
+		 * We need hot_standby_feedback to make sure the master doesn't vacuum
+		 * away tuples we need.
+		 *
+		 * This check doesn't stop the user disabling it once we check, but they
+		 * could also drop and re-create the physical replication slot without
+		 * our noticing or do other silly things. Don't do that. If they do it
+		 * anyway we'll notice and fail with conflict with recovery later.
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback is not enabled")));
+	}
 }
 
 /*
@@ -126,6 +164,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
 	 * xmin horizons by other backends, get the safe decoding xid, and inform
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -963,3 +1011,244 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.
+ *
+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.
+ *
+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally
+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply
+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */
+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{
+	TimestampTz waitStart;
+	char	   *new_status;
+	XLogRecPtr firstWaitWalEnd, lastWaitWalEnd;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(TransactionIdIsValid(slot->effective_catalog_xmin));
+	Assert(slot->effective_catalog_xmin == slot->data.catalog_xmin);
+
+	waitStart = GetCurrentTimestamp();
+	new_status = NULL;			/* we haven't changed the ps display */
+
+	/*
+	 * The master doesn't reply to hot standby feedback explicitly,
+	 * identify which message is the most recent, nor does it report
+	 * the catalog_xmin reserved.
+	 *
+	 * This leaves a potential race. If catalog_xmin is already pinned down by
+	 * some other slot on the master or another standby,
+	 * ShmemVariableCache->oldestCatalogXmin will be set by it. We don't know
+	 * if our hot standby feedback is in effect and pinning down catalog_xmin
+	 * yet. If we start at the current oldestCatalogXmin the other slot might
+	 * advance and allow vacuum to remove tuples we need before our hot standby
+	 * feedback can lock it in. This may result in a conflict with standby at
+	 * some point after we create the slot and start decoding, when we see the
+	 * new xl_xact_catalog_xmin_advance record, unless our own catalog_xmin has
+	 * advanced enough by then that we no longer need the removed catalogs.
+	 * That can only happen if the xact holding down catalog_xmin has committed
+	 * by the time the needed catalogs are removed, so we can decode it,
+	 * advance confirmed_flush_lsn, and advance restart_lsn + catalog_xmin.
+	 *
+	 * To reduce the chances of triggering this race we force immediate
+	 * hot_standby_feedback, wait for a new latestWalEnd report from the
+	 * sender, and wait until we replay past that before we take the
+	 * catalog_xmin to start from. Without the ability to ask the walsender
+	 * to verify receipt of, and successful reservation of, a specific hot
+	 * standby feedback message this is the best we can do.
+	 *
+	 * If we lose the race, decoding will fail with a recovery conflict later.
+	 * The client will have to drop the slot and try again.
+	 *
+	 * Users can further mitigate this risk with a sufficiently high
+	 * vacuum_defer_cleanup_age.
+	 *
+	 * Users can completely prevent this problem by creating a temporary
+	 * logical slot on the master and waiting for the replica to catch up to
+	 * the master's xlog insert position before they create a slot on the
+	 * replica. Then wait until a catalog_xmin is reported on the replica's
+	 * physical slot before dropping the temporary slot on the master.
+	 *
+	 * TODO: get reply from server explicitly confirming that it has applied
+	 * our hs_feedback and what the lowest catalog_xmin it can honour is.
+	 * We'll need some kind of cookie so we can tell the server is replying
+	 * to us not someone else, especially in cascading setups.
+	 */
+
+	firstWaitWalEnd = lastWaitWalEnd = WalRcv->latestWalEnd;
+
+	WalRcvForceReply();
+
+	while (lastWaitWalEnd == firstWaitWalEnd ||
+		   GetXLogReplayRecPtr(NULL) < lastWaitWalEnd ||
+		   !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+	{
+		int ret;
+		XLogRecPtr ptr = GetXLogReplayRecPtr(NULL);
+
+		elog(DEBUG1, "XXX firstEnd %X/%X, lastEnd %X/%X; ptr %X/%X; oldestCatalogXmin %u",
+			(uint32)(firstWaitWalEnd>>32), (uint32)(firstWaitWalEnd),
+			(uint32)(lastWaitWalEnd>>32), (uint32)(lastWaitWalEnd),
+			(uint32)(ptr>>32), (uint32)(ptr),
+			ShmemVariableCache->oldestCatalogXmin);
+
+		/*
+		 * We need to advance our slot's catalog_xmin to keep pace with the
+		 * latest reported position from the master. That way we won't get
+		 * canceled with a recovery conflict when the master sends catalog_xmin
+		 * updates while we're waiting for redo to catch up with the position
+		 * we saw when we started waiting.
+		 *
+		 * A problem arises here when the server sends an
+		 * xl_xact_catalog_xmin_advance with oldestCatalogXmin = 0, indicating
+		 * it is no longer reserving catalogs. Since we're creating a slot we
+		 * don't mind, but the redo code does not know that and will treat our
+		 * process as conflicting with recovery. To guard against that we'll
+		 * advance our oldestCatalogXmin to the new
+		 * GetOldestSafeDecodingTransactionId() and redo will ignore slots
+		 * whose catalog_xmin is >= nextXid. So long as we loop faster than the
+		 * maximum standby delay we'll keep ahead of recovery cancellations.
+		 * This means we must take XidGenLock once per loop, but it's not like
+		 * we spend a lot of time creating slots.
+		 *
+		 * It's fine for our catalog_xmin to go backwards when the server
+		 * reports it has nailed down catalog_xmin so we just uncondtionally
+		 * reassign our catalog_xmin.
+		 */
+		slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+		ReplicationSlotsComputeRequiredXmin(true);
+
+		LWLockRelease(ProcArrayLock);
+
+		ret = WaitLatch(&MyProc->procLatch,
+						WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						500, WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE);
+
+		if (ret & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (ret & WL_LATCH_SET)
+			ResetLatch(&MyProc->procLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Notice if the server has reported new WAL since we sent our feedback */
+		if (lastWaitWalEnd == firstWaitWalEnd)
+			lastWaitWalEnd = WalRcv->latestWalEnd;
+
+		/* Update process title if waiting long enough */
+		if (update_process_title && new_status == NULL &&
+			TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
+									   500))
+		{
+			const char *old_status;
+			int			len;
+
+			old_status = get_ps_display(&len);
+			new_status = (char *) palloc(len + 8 + 1);
+			memcpy(new_status, old_status, len);
+			strcpy(new_status + len, " waiting");
+			set_ps_display(new_status, false);
+			new_status[len] = '\0'; /* truncate off " waiting" */
+		}
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	}
+
+	if (TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin))
+	{
+		/*
+		 * We didn't reserve the catalog_xmin we wanted, the master has already removed it.
+		 * We have to start decoding at a later point.
+		 */
+		slot->effective_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	}
+
+	ReplicationSlotsComputeRequiredXmin(true);
+
+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+
+	Assert(TransactionIdFollowsOrEquals(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin));
+	Assert(LWLockHeldByMe(ProcArrayLock));
+}
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past
+	 * the threshold we need, removing tuples that we'll require to start
+	 * decoding at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is needed
+		 * here since oldestCatalogXmin can only advance, so if it's past what
+		 * we need that's not going to change. We have marked our slot as active
+		 * so redo won't replay past our catalog_xmin without first terminating our
+		 * session.
+		 */
+		TransactionId shmem_catalog_xmin =
+			*(volatile TransactionId*)(&ShmemVariableCache->oldestCatalogXmin);
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							 NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index cf814d1..6ca2a00 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -795,6 +795,93 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ *
+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
@@ -842,7 +929,9 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
+		 * (TODO: ask walreceiver to ask walsender to log it or ask bgworker to log it)
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..43b33ef 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,9 +498,15 @@ WalReceiverMain(void)
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
+						if (XLogLogicalInfoActive())
+							XLogWalRcvSendHSFeedback(true);
 						XLogWalRcvSendReply(true, false);
 					}
 				}
@@ -1164,8 +1170,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	static bool master_has_standby_xmin = false;
 
@@ -1206,29 +1212,57 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+		
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 46976ce..33e2c1b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,7 +188,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -217,6 +216,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1556,6 +1556,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1618,7 +1623,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1639,6 +1644,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1649,59 +1670,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1726,15 +1780,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
@@ -2607,17 +2669,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2651,7 +2702,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0f63755..e8b21e4 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1291,17 +1291,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1375,9 +1380,13 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held. Note that
+	 * we're using the effective catalog_xmin for vacuum's tuple removal here,
+	 * as copied over by UpdateOldestCatalogXmin().
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (RecoveryInProgress())
 	{
@@ -1426,19 +1435,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
+
+	return result;
+}
+
+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to
+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	/*
+	 * If we're not recording logical decoding information, catalog_xmin
+	 * must be unset and we don't need to do any work here.
+	 *
+	 * XXX TODO make sure we zero the checkpointed value when we turn logical decoding
+	 * off, and check it during startup!!
+	 */
+	if (!XLogLogicalInfoActive() && !force)
+	{
+		Assert(!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin));
+		Assert(!TransactionIdIsValid(procArray->replication_slot_catalog_xmin));
+	}
+
+	Assert(XLogInsertAllowed());
+
 	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
+	 * Do an unlocked check first. This is obviously race-prone especially
+	 * since replication_slot_catalog_xmin could be updated after we read
+	 * oldestCatalogXmin. But it doesn't matter if we get wrong results here,
+	 * it'll either cause us to take an unnecessary ProcArrayLock to recheck,
+	 * or delay an update until the next vacuum run.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+	slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
 
-	return result;
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -2166,14 +2249,20 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
+
+	/*
+	 * TODO: If we're on replica and using hot standby feedback to set catalog_xmin
+	 * we should be able to directly check the value reserved by feedback via shmem
+	 * from walreceiver, even if xlog replay hasn't passed that point yet.
+	 */
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2655,6 +2744,53 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+			
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
@@ -2929,18 +3065,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		 *retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		 *needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index a3d6ac5..d17dba1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 112fe07..8e3a3b7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,7 +153,9 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
@@ -1108,3 +1111,145 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 					 nmsgs * sizeof(SharedInvalidationMessage));
 	XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
 }
+
+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		/* Reset standby wait back-off delay for each session waited for */
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but if we're an intermediate
+		 * cascading standby all we do is pass the catalog_xmin up to our
+		 * master and relay WAL down to the cascaded replica. Conflicts are the
+		 * cascaded replica's problem.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of in-use logical slots.
+		 * Inactive slots have the same effective and actual catalog_xmin, and
+		 * we'll detect conflicts with those when an attempt is made to use
+		 * them. Active slots' catalog_xmin can't go backwards unless they
+		 * become inactive.
+		 *
+		 * We specifically ignore catalog_xmin reservations >= nextXid here to allow
+		 * for slots still being created; see function comment.
+		 */
+		while (slot->in_use && slot->active_pid != 0 &&
+			   TransactionIdIsValid(slot->effective_catalog_xmin) &&
+			   (!TransactionIdIsValid(new_catalog_xmin) ||
+				TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)) &&
+			   TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->nextXid))
+		{
+			/*
+			 * Wait for the conflicting session to exit, signalling it with
+			 * a conflict if necessary.
+			 *
+			 * We'll sleep here, so release the replication slot control lock. No
+			 * new conflicts can appear "behind" our scan of the replication_slots
+			 * array because sessions check the oldestCatalogXmin on decoding
+			 * startup. This lets the exiting backend clear the slot's its
+			 * active_pid.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				/* 
+				 * As a safeguard against signalling the wrong process in case of
+				 * pid reassignment, check that the slot's active_pid hasn't been
+				 * cleared or changed. Do an unlocked read here since the worst
+				 * wrong outcome even in the case of garbage read is an extra
+				 * sleep. If you get a new backend with the same pid in the
+				 * same slot array position you have terrible luck, and it
+				 * might get cancelled with a spurious conflict. 
+				 */
+				if (active_pid != slot->active_pid)
+					continue;
+
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/*
+				 * Wait a little bit for it to die so that we avoid flooding
+				 * an unresponsive backend when system is heavily loaded.
+				 */
+				pg_usleep(5000L);
+			}
+			
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b179231..5cf92ac 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2262,6 +2262,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2698,8 +2701,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2781,6 +2788,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2795,12 +2803,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2855,11 +2864,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3bad417 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -242,6 +242,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 969eff9..50f68e8 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+
+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */
+
 } VariableCacheData;
 
 typedef VariableCacheData *VariableCache;
@@ -173,6 +177,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index a123d2a..17e4306 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -118,7 +118,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -167,6 +167,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+} xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -370,6 +377,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 0bc41ab..df19adc 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -43,6 +43,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin; /* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 282f8ae..515479d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -745,7 +745,8 @@ typedef enum
 	WAIT_EVENT_SYSLOGGER_MAIN,
 	WAIT_EVENT_WAL_RECEIVER_MAIN,
 	WAIT_EVENT_WAL_SENDER_MAIN,
-	WAIT_EVENT_WAL_WRITER_MAIN
+	WAIT_EVENT_WAL_WRITER_MAIN,
+	WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;
 
 /* ----------
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index b653e5c..5638130 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 28dc1fc..1a771e7 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -116,6 +116,9 @@ typedef struct
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.
+	 *
+	 * If hot standby feedback is enabled, a hot standby feedback message
+	 * will also be sent.
 	 */
 	bool		force_reply;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index dd37c0c..0592aff 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
@@ -78,6 +78,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
@@ -86,6 +88,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(bool force);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index f67b982..8e37e29 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index dcebf72..cc04186 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,6 +34,8 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..3f57230
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,454 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--xlog-method=stream', '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, '', 'stderr is empty');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# an initial decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+# If we change the catalogs, we'll get a conflict with recovery, but only
+# if there's an active xact when decoding. Logical decoding
+# doesn't keep a virtualxid while waiting for WAL, only when calling output
+# plugins, so this won't work damn.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+sleep(2);
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#23Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#22)
Re: Logical decoding on standby

On 4 January 2017 at 12:08, Craig Ringer <craig@2ndquadrant.com> wrote:

0001 incorporates the changes requested by Michael Paquier. Simon
expressed his intention to commit this after updates, in the separate
thread [...]

...

0005 (new streaming rep tests) is updated for the changes in 0001,
otherwise unchanged. Simon said he wanted to commit this soon.

0006 (timeline following) is unchanged except for updates to be
compatible with 0001.

0007 is the optional snapshot export requested by Petr.

0008 is unchanged.

0009 is unchanged except for updates vs 0001 and use of the
WITHOUT_SNAPSHOT option added in 0007.

Oh, note that it's possible to commit 0001 then 0005, skipping over
2..4. I should probably have ordered them that way.

That's particularly relevant to you Simon as you expressed a wish to
commit the new streaming rep tests.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#23)
Re: Logical decoding on standby

On 4 January 2017 at 12:15, Craig Ringer <craig@2ndquadrant.com> wrote:

That's particularly relevant to you Simon as you expressed a wish to
commit the new streaming rep tests.

Patches 0001 and 0005 in this series also posted as
/messages/by-id/CAMsr+YHxTMrY1woH_m4bEF3f5+kxX_T=sDuyXf4d2-+e-56iFg@mail.gmail.com
, since they're really pre-requisites not part of decoding on standby
as such. I'll post a new series with them removed once committed.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#24)
Re: Logical decoding on standby

On 4 January 2017 at 16:19, Craig Ringer <craig@2ndquadrant.com> wrote:

On 4 January 2017 at 12:15, Craig Ringer <craig@2ndquadrant.com> wrote:

That's particularly relevant to you Simon as you expressed a wish to
commit the new streaming rep tests.

Simon committed 1, 2, 3 and 5:

* Extra PostgresNode methods
* pg_recvlogical --endpos
* Tests for pg_recvlogical
* Expand streaming replication tests to cover hot standby

so here's a rebased series on top of master. No other changes.

The first patch to add a pg_recvlogical wrapper to PostgresNode is
really only needed to test the rest of the patches, so it can be
committed together with them.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#25)
5 attachment(s)
Re: Logical decoding on standby

On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

so here's a rebased series on top of master. No other changes.

Now with actual patches.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchtext/x-patch; charset=US-ASCII; name=0001-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchDownload
From 23c51f2deadfc7ca0bf131f7a16dd35bd2b31847 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 15 Nov 2016 16:06:16 +0800
Subject: [PATCH 1/6] Add a pg_recvlogical wrapper to PostgresNode

---
 src/test/perl/PostgresNode.pm               | 75 ++++++++++++++++++++++++++++-
 src/test/recovery/t/006_logical_decoding.pl | 31 +++++++++++-
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2a4ceb3..66e01d6 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1124,7 +1124,7 @@ sub psql
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
-			  if (blessed($exc_save) || $exc_save ne $timeout_exception);
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
 
 			$ret = undef;
 
@@ -1493,6 +1493,79 @@ sub slot
 	return $self->query_hash('postgres', "SELECT __COLUMNS__ FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'", @columns);
 }
 
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...)
+
+Invoke pg_recvlogical to read from slot_name on dbname until LSN endpos, which
+corresponds to pg_recvlogical --endpos.  Gives up after timeout (if nonzero).
+
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
+
+Plugin options are passed as additional keyword arguments.
+
+If called in scalar context, returns stdout, and die()s on timeout or nonzero return.
+
+If called in array context, returns a tuple of (retval, stdout, stderr, timeout).
+timeout is the IPC::Run::Timeout object whose is_expired method can be tested
+to check for timeout. retval is undef on timeout.
+
+=cut
+
+sub pg_recvlogical_upto
+{
+	my ($self, $dbname, $slot_name, $endpos, $timeout_secs, %plugin_options) = @_;
+	my ($stdout, $stderr);
+
+	my $timeout_exception = 'pg_recvlogical timed out';
+
+	my @cmd = ('pg_recvlogical', '-S', $slot_name, '--dbname', $self->connstr($dbname));
+	push @cmd, '--endpos', $endpos if ($endpos);
+	push @cmd, '-f', '-', '--no-loop', '--start';
+
+	while (my ($k, $v) = each %plugin_options)
+	{
+		die "= is not permitted to appear in replication option name" if ($k =~ qr/=/);
+		push @cmd, "-o", "$k=$v";
+	}
+
+	my $timeout;
+	$timeout = IPC::Run::timeout($timeout_secs, exception => $timeout_exception ) if $timeout_secs;
+	my $ret = 0;
+
+	do {
+		local $@;
+		eval {
+			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
+			$ret = $?;
+		};
+		my $exc_save = $@;
+		if ($exc_save)
+		{
+			# IPC::Run::run threw an exception. re-throw unless it's a
+			# timeout, which we'll handle by testing is_expired
+			die $exc_save
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
+
+			$ret = undef;
+
+			die "Got timeout exception '$exc_save' but timer not expired?!"
+			  unless $timeout->is_expired;
+
+			die "$exc_save waiting for endpos $endpos with stdout '$stdout', stderr '$stderr'"
+				unless wantarray;
+		}
+	};
+
+	if (wantarray)
+	{
+		return ($ret, $stdout, $stderr, $timeout);
+	}
+	else
+	{
+		die "pg_recvlogical exited with code '$ret', stdout '$stdout' and stderr '$stderr'" if $ret;
+		return $stdout;
+	}
+}
+
 =pod
 
 =back
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index b80a9a9..d8cc8d3 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -1,9 +1,13 @@
 # Testing of logical decoding using SQL interface and/or pg_recvlogical
+#
+# Most logical decoding tests are in contrib/test_decoding. This module
+# is for work that doesn't fit well there, like where server restarts
+# are required.
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -36,5 +40,30 @@ $result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_chan
 chomp($result);
 is($result, '', 'Decoding after fast restart repeats no rows');
 
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
+
+my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
+
+$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
+
 # done with the node
 $node_master->stop;
-- 
2.5.5

0002-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0002-Follow-timeline-switches-in-logical-decoding.patchDownload
From 347daf05fa9e5e18b26c5f359e3f28c634b46315 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 2/6] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 200 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   7 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 347 insertions(+), 22 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 0de2419..cadacb5 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -660,6 +661,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -675,7 +677,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -702,6 +705,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -750,6 +754,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -770,28 +897,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= read_upto)
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
+
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d16d6da..6e4935e 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -234,13 +234,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
+	ReplicationSlotAcquire(NameStr(*name));
+
 	/* compute the current end-of-wal */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
-	ReplicationSlotAcquire(NameStr(*name));
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	PG_TRY();
 	{
@@ -279,6 +279,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f3082c3..93c2816 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -760,6 +761,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -992,10 +999,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 00102e8..88197b9 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -160,6 +160,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 * 
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 567a7f3..25a9942 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..142a1b8 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0003-Make-snapshot-export-on-logical-slot-creation-option.patchtext/x-patch; charset=US-ASCII; name=0003-Make-snapshot-export-on-logical-slot-creation-option.patchDownload
From 29ad004eeb735d7255c270fd50fa217f8c9838b3 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 21 Dec 2016 11:21:46 +0800
Subject: [PATCH 3/6] Make snapshot export on logical slot creation optional

Allow logical decoding slot creation via the walsender protocol's
CREATE_REPLICATION_SLOT command to optionally suppress exporting of
a snapshot when the WITHOUT_SNAPSHOT option is passed.

This means that when we allow creation of replication slots on standbys, which
cannot export snapshots, we don't have to silently omit the snapshot creation.
It also allows clients like pg_recvlogical, which neither need nor can use the
exported snapshot, to suppress its creation. Since snapshot exporting can fail
this improves reliability.
---
 doc/src/sgml/logicaldecoding.sgml      | 13 ++++++++++---
 doc/src/sgml/protocol.sgml             | 17 +++++++++++++++--
 src/backend/replication/repl_gram.y    | 13 ++++++++++---
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    |  9 ++++++++-
 src/bin/pg_basebackup/streamutil.c     |  5 +++++
 src/include/nodes/replnodes.h          |  1 +
 7 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 484915d..c0b6987 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -268,11 +268,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </para>
    </sect2>
 
-   <sect2>
+   <sect2 id="logicaldecoding-snapshot-export" xreflabel="Exported Snapshots (Logical Decoding)">
     <title>Exported Snapshots</title>
     <para>
-     When a new replication slot is created using the streaming replication interface,
-     a snapshot is exported
+     When <link linkend="protocol-replication-create-slot">a new replication
+     slot is created using the streaming replication interface</>, a snapshot
+     is exported
      (see <xref linkend="functions-snapshot-synchronization">), which will show
      exactly the state of the database after which all changes will be
      included in the change stream. This can be used to create a new replica by
@@ -282,6 +283,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
      database's state at that point in time, which afterwards can be updated
      using the slot's contents without losing any changes.
     </para>
+    <para>
+     Creation of a snapshot is not always possible - in particular, it will
+     fail when connected to a hot standby. Applications that do not require
+     snapshot export may suppress it with the <literal>WITHOUT_SNAPSHOT</>
+     option.
+    </para>
    </sect2>
   </sect1>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 9ba147c..e41c650 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1433,8 +1433,8 @@ The commands accepted in walsender mode are:
     </listitem>
   </varlistentry>
 
-  <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+  <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> [<literal>WITHOUT_SNAPSHOT</>] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1485,6 +1485,19 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>WITHOUT_SNAPSHOT</></term>
+       <listitem>
+        <para>
+         By default, logical replication slot creation exports a snapshot for
+         use in initialization; see <xref linkend="logicaldecoding-snapshot-export">.
+         Because not all clients need an exported snapshot its creation can
+         be suppressed with <literal>WITHOUT_SNAPSHOT</>.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
      </variablelist>
     </listitem>
   </varlistentry>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index d962c76..b2786cc 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -78,6 +78,7 @@ Node *replication_parse_result;
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_WITHOUT_SNAPSHOT
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,7 +91,7 @@ Node *replication_parse_result;
 %type <defelt>	plugin_opt_elem
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
-%type <boolval>	opt_reserve_wal opt_temporary
+%type <boolval>	opt_reserve_wal opt_temporary opt_without_snapshot
 
 %%
 
@@ -194,8 +195,8 @@ create_replication_slot:
 					cmd->reserve_wal = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin WITHOUT_SNAPSHOT */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT opt_without_snapshot
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
@@ -203,6 +204,7 @@ create_replication_slot:
 					cmd->slotname = $2;
 					cmd->temporary = $3;
 					cmd->plugin = $5;
+					cmd->without_snapshot = $6;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -283,6 +285,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_without_snapshot:
+			K_WITHOUT_SNAPSHOT				{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index a3b5f92..6b683aa 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -96,6 +96,7 @@ DROP_REPLICATION_SLOT		{ return K_DROP_REPLICATION_SLOT; }
 TIMELINE_HISTORY	{ return K_TIMELINE_HISTORY; }
 PHYSICAL			{ return K_PHYSICAL; }
 RESERVE_WAL			{ return K_RESERVE_WAL; }
+WITHOUT_SNAPSHOT	{ return K_WITHOUT_SNAPSHOT; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 93c2816..a1d1c0c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -849,7 +849,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * Export a plain (not of the snapbuild.c type) snapshot to the user
 		 * that can be imported into another session.
 		 */
-		snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		if (cmd->without_snapshot)
+			snapshot_name = "";
+		else if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot export a snapshot from a standby")));
+		else
+			snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
 
 		/* don't need the decoding context anymore */
 		FreeDecodingContext(ctx);
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 01be3e7..660fefd 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -346,8 +346,13 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" PHYSICAL",
 						  slot_name);
 	else
+	{
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" LOGICAL \"%s\"",
 						  slot_name, plugin);
+		if (PQserverVersion(conn) >= 100000)
+			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
+			appendPQExpBuffer(query, " WITHOUT_SNAPSHOT");
+	}
 
 	res = PQexec(conn, query->data);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index f27354f..0ce21b9 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -57,6 +57,7 @@ typedef struct CreateReplicationSlotCmd
 	char	   *plugin;
 	bool		temporary;
 	bool		reserve_wal;
+	bool		without_snapshot;
 } CreateReplicationSlotCmd;
 
 
-- 
2.5.5

0004-ERROR-if-timeline-is-zero-in-walsender.patchtext/x-patch; charset=US-ASCII; name=0004-ERROR-if-timeline-is-zero-in-walsender.patchDownload
From 3035b3252bf36f5497d09ac6387095c2779d6ed4 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 1 Jun 2016 13:50:52 +0800
Subject: [PATCH 4/6] ERROR if timeline is zero in walsender

---
 src/backend/replication/walsender.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a1d1c0c..30d01e3 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -523,6 +523,11 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 
+	if (ThisTimeLineID == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("run IDENTIFY_SYSTEM before trying to START_REPLICATION")));
+
 	/*
 	 * We assume here that we're logging enough information in the WAL for
 	 * log-shipping, since this is checked in PostmasterMain().
-- 
2.5.5

0005-Logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0005-Logical-decoding-on-standby.patchDownload
From 1d0cd03744353587f5e25038fcaab554a7077cb4 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 5/6] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
  exit with recovery conflict on upstream drop database when it has an active
  logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.
* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
  requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
  it in checkpoints, make vacuum and bgwriter advance it.
* During decoding startup check whether catalog_xmin requirement can be satisfied
  and bail out if it can not
* Add a new recovery conflict type for conflict with catalog_xmin. Abort
  in-progress logical decoding sessions with conflict with recovery where needed
  catalog_xmin is too old
* Make extra efforts to reserve master's catalog_xmin during decoding startup
  on standby.
* Try to make sure hot_standby_feedback is active when starting
  logical decoding.
* Remove checks preventing starting logical decoding on standby
---
 contrib/pg_visibility/pg_visibility.c              |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 +-
 src/backend/access/heap/heapam.c                   |   2 +-
 src/backend/access/heap/rewriteheap.c              |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c             |   9 +
 src/backend/access/transam/varsup.c                |  15 +
 src/backend/access/transam/xact.c                  |  55 +++
 src/backend/access/transam/xlog.c                  |  26 +-
 src/backend/catalog/index.c                        |   2 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/dbcommands.c                  |   6 +
 src/backend/commands/vacuum.c                      |  13 +-
 src/backend/postmaster/bgwriter.c                  |   9 +
 src/backend/postmaster/pgstat.c                    |   2 +
 src/backend/replication/logical/decode.c           |  11 +
 src/backend/replication/logical/logical.c          | 323 ++++++++++++++-
 src/backend/replication/slot.c                     |  91 ++++-
 src/backend/replication/walreceiver.c              |  52 ++-
 src/backend/replication/walsender.c                | 135 ++++--
 src/backend/storage/ipc/procarray.c                | 201 +++++++--
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  | 147 ++++++-
 src/backend/tcop/postgres.c                        |  38 +-
 src/bin/pg_controldata/pg_controldata.c            |   2 +
 src/include/access/transam.h                       |   5 +
 src/include/access/xact.h                          |  12 +-
 src/include/catalog/pg_control.h                   |   1 +
 src/include/pgstat.h                               |   3 +-
 src/include/replication/slot.h                     |   1 +
 src/include/replication/walreceiver.h              |   3 +
 src/include/storage/procarray.h                    |   9 +-
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +
 .../recovery/t/010_logical_decoding_on_replica.pl  | 454 +++++++++++++++++++++
 35 files changed, 1548 insertions(+), 129 deletions(-)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5580637..34898f6 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 8db1e20..743cbee 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e41c650..3b8f06f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1807,10 +1807,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1820,7 +1821,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled. New in 10.0.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby. New in 10.0.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1ce42ea..a32c889 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7300,7 +7300,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 90ab6f2..7bf6fd1 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -810,7 +810,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index c91ca03..e1489c1 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fc084c5..60a6319 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -393,6 +393,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
+	elog(DEBUG1, "XXX advancing catalogXmin from %u to %u", ShmemVariableCache->oldestCatalogXmin, oldestCatalogXmin);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f5346f0..aed4dea 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5641,6 +5641,61 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Unless logical decoding is possible on this node, we don't care about
+		 * this record.
+		 */
+		if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+			return;
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+		/*
+		 * Notify any active logical decoding sessions to terminate if they
+		 * need the catalogs we're going to be allowed to remove after
+		 * replaying this record.
+		 */
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}
+
+	return ptr;
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70edafa..da85d69 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4840,6 +4840,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -4852,6 +4853,7 @@ BootStrapXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6430,6 +6432,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6446,6 +6451,7 @@ StartupXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8506,6 +8512,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -8709,7 +8716,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9072,7 +9079,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
@@ -9263,6 +9270,16 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		/*
+		 * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+		 * require oldestCatalogXmin in checkpoints and it no longer
+		 * makes sense, so update shmem and xlog the change. This will
+		 * get written out in the next checkpoint.
+		 */
+		if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+			wal_level < WAL_LEVEL_LOGICAL)
+			UpdateOldestCatalogXmin(true);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
@@ -9431,6 +9448,7 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9529,8 +9547,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cac0cbf..c7ba896 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2272,7 +2272,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e3e1a53..d48d0b1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -993,7 +993,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2833f3e..02af7b4 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2119,11 +2119,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0f72c1c..4ad3793 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -488,6 +488,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin(false);
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
@@ -497,7 +506,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -909,7 +918,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4081982..a209c43 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin(false);
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f37a0bf..5fdb552 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3303,6 +3303,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
+		case WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE:
+			event_name = "StandbyLogicalSlotCreate";
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..07a120d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..10b6be0 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,10 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void WaitForMasterCatalogXminReservation(ReplicationSlot *slot);
+
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));
+
+		/*
+		 * When decoding on a standby we need a physical slot to be used by the
+		 * walrececiver so we can pin the upstream's catalog_xmin down even
+		 * over connection loss and restarts. This also gives us somewhere to
+		 * record our needed catalog xmin on the master.
+		 */
+		if (!walrcv_has_slot)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("no replication slot configured for connection to master"),
+					 errhint("Logical decoding on standby requires that a physical replication slot be used to connect the standby to the master.")));
+
+		/*
+		 * We need hot_standby_feedback to make sure the master doesn't vacuum
+		 * away tuples we need.
+		 *
+		 * This check doesn't stop the user disabling it once we check, but they
+		 * could also drop and re-create the physical replication slot without
+		 * our noticing or do other silly things. Don't do that. If they do it
+		 * anyway we'll notice and fail with conflict with recovery later.
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback is not enabled")));
+	}
 }
 
 /*
@@ -126,6 +164,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
 	 * xmin horizons by other backends, get the safe decoding xid, and inform
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -963,3 +1011,244 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.
+ *
+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.
+ *
+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally
+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply
+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */
+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{
+	TimestampTz waitStart;
+	char	   *new_status;
+	XLogRecPtr firstWaitWalEnd, lastWaitWalEnd;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(TransactionIdIsValid(slot->effective_catalog_xmin));
+	Assert(slot->effective_catalog_xmin == slot->data.catalog_xmin);
+
+	waitStart = GetCurrentTimestamp();
+	new_status = NULL;			/* we haven't changed the ps display */
+
+	/*
+	 * The master doesn't reply to hot standby feedback explicitly,
+	 * identify which message is the most recent, nor does it report
+	 * the catalog_xmin reserved.
+	 *
+	 * This leaves a potential race. If catalog_xmin is already pinned down by
+	 * some other slot on the master or another standby,
+	 * ShmemVariableCache->oldestCatalogXmin will be set by it. We don't know
+	 * if our hot standby feedback is in effect and pinning down catalog_xmin
+	 * yet. If we start at the current oldestCatalogXmin the other slot might
+	 * advance and allow vacuum to remove tuples we need before our hot standby
+	 * feedback can lock it in. This may result in a conflict with standby at
+	 * some point after we create the slot and start decoding, when we see the
+	 * new xl_xact_catalog_xmin_advance record, unless our own catalog_xmin has
+	 * advanced enough by then that we no longer need the removed catalogs.
+	 * That can only happen if the xact holding down catalog_xmin has committed
+	 * by the time the needed catalogs are removed, so we can decode it,
+	 * advance confirmed_flush_lsn, and advance restart_lsn + catalog_xmin.
+	 *
+	 * To reduce the chances of triggering this race we force immediate
+	 * hot_standby_feedback, wait for a new latestWalEnd report from the
+	 * sender, and wait until we replay past that before we take the
+	 * catalog_xmin to start from. Without the ability to ask the walsender
+	 * to verify receipt of, and successful reservation of, a specific hot
+	 * standby feedback message this is the best we can do.
+	 *
+	 * If we lose the race, decoding will fail with a recovery conflict later.
+	 * The client will have to drop the slot and try again.
+	 *
+	 * Users can further mitigate this risk with a sufficiently high
+	 * vacuum_defer_cleanup_age.
+	 *
+	 * Users can completely prevent this problem by creating a temporary
+	 * logical slot on the master and waiting for the replica to catch up to
+	 * the master's xlog insert position before they create a slot on the
+	 * replica. Then wait until a catalog_xmin is reported on the replica's
+	 * physical slot before dropping the temporary slot on the master.
+	 *
+	 * TODO: get reply from server explicitly confirming that it has applied
+	 * our hs_feedback and what the lowest catalog_xmin it can honour is.
+	 * We'll need some kind of cookie so we can tell the server is replying
+	 * to us not someone else, especially in cascading setups.
+	 */
+
+	firstWaitWalEnd = lastWaitWalEnd = WalRcv->latestWalEnd;
+
+	WalRcvForceReply();
+
+	while (lastWaitWalEnd == firstWaitWalEnd ||
+		   GetXLogReplayRecPtr(NULL) < lastWaitWalEnd ||
+		   !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+	{
+		int ret;
+		XLogRecPtr ptr = GetXLogReplayRecPtr(NULL);
+
+		elog(DEBUG1, "XXX firstEnd %X/%X, lastEnd %X/%X; ptr %X/%X; oldestCatalogXmin %u",
+			(uint32)(firstWaitWalEnd>>32), (uint32)(firstWaitWalEnd),
+			(uint32)(lastWaitWalEnd>>32), (uint32)(lastWaitWalEnd),
+			(uint32)(ptr>>32), (uint32)(ptr),
+			ShmemVariableCache->oldestCatalogXmin);
+
+		/*
+		 * We need to advance our slot's catalog_xmin to keep pace with the
+		 * latest reported position from the master. That way we won't get
+		 * canceled with a recovery conflict when the master sends catalog_xmin
+		 * updates while we're waiting for redo to catch up with the position
+		 * we saw when we started waiting.
+		 *
+		 * A problem arises here when the server sends an
+		 * xl_xact_catalog_xmin_advance with oldestCatalogXmin = 0, indicating
+		 * it is no longer reserving catalogs. Since we're creating a slot we
+		 * don't mind, but the redo code does not know that and will treat our
+		 * process as conflicting with recovery. To guard against that we'll
+		 * advance our oldestCatalogXmin to the new
+		 * GetOldestSafeDecodingTransactionId() and redo will ignore slots
+		 * whose catalog_xmin is >= nextXid. So long as we loop faster than the
+		 * maximum standby delay we'll keep ahead of recovery cancellations.
+		 * This means we must take XidGenLock once per loop, but it's not like
+		 * we spend a lot of time creating slots.
+		 *
+		 * It's fine for our catalog_xmin to go backwards when the server
+		 * reports it has nailed down catalog_xmin so we just uncondtionally
+		 * reassign our catalog_xmin.
+		 */
+		slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+		ReplicationSlotsComputeRequiredXmin(true);
+
+		LWLockRelease(ProcArrayLock);
+
+		ret = WaitLatch(&MyProc->procLatch,
+						WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						500, WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE);
+
+		if (ret & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (ret & WL_LATCH_SET)
+			ResetLatch(&MyProc->procLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Notice if the server has reported new WAL since we sent our feedback */
+		if (lastWaitWalEnd == firstWaitWalEnd)
+			lastWaitWalEnd = WalRcv->latestWalEnd;
+
+		/* Update process title if waiting long enough */
+		if (update_process_title && new_status == NULL &&
+			TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
+									   500))
+		{
+			const char *old_status;
+			int			len;
+
+			old_status = get_ps_display(&len);
+			new_status = (char *) palloc(len + 8 + 1);
+			memcpy(new_status, old_status, len);
+			strcpy(new_status + len, " waiting");
+			set_ps_display(new_status, false);
+			new_status[len] = '\0'; /* truncate off " waiting" */
+		}
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	}
+
+	if (TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin))
+	{
+		/*
+		 * We didn't reserve the catalog_xmin we wanted, the master has already removed it.
+		 * We have to start decoding at a later point.
+		 */
+		slot->effective_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	}
+
+	ReplicationSlotsComputeRequiredXmin(true);
+
+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+
+	Assert(TransactionIdFollowsOrEquals(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin));
+	Assert(LWLockHeldByMe(ProcArrayLock));
+}
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past
+	 * the threshold we need, removing tuples that we'll require to start
+	 * decoding at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is needed
+		 * here since oldestCatalogXmin can only advance, so if it's past what
+		 * we need that's not going to change. We have marked our slot as active
+		 * so redo won't replay past our catalog_xmin without first terminating our
+		 * session.
+		 */
+		TransactionId shmem_catalog_xmin =
+			*(volatile TransactionId*)(&ShmemVariableCache->oldestCatalogXmin);
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							 NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 10d69d0..f4d4e39 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -795,6 +795,93 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ *
+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
@@ -842,7 +929,9 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
+		 * (TODO: ask walreceiver to ask walsender to log it or ask bgworker to log it)
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c6b54ec..a3156da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,9 +498,15 @@ WalReceiverMain(void)
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
+						if (XLogLogicalInfoActive())
+							XLogWalRcvSendHSFeedback(true);
 						XLogWalRcvSendReply(true, false);
 					}
 				}
@@ -1164,8 +1170,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	static bool master_has_standby_xmin = false;
 
@@ -1206,29 +1212,57 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+		
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 30d01e3..4bcd1d7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,7 +188,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -217,6 +216,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1556,6 +1556,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1618,7 +1623,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1639,6 +1644,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1649,59 +1670,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1726,15 +1780,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
@@ -2607,17 +2669,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2651,7 +2702,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 83b0c71..9794cf0 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1291,17 +1291,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1375,9 +1380,13 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held. Note that
+	 * we're using the effective catalog_xmin for vacuum's tuple removal here,
+	 * as copied over by UpdateOldestCatalogXmin().
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (RecoveryInProgress())
 	{
@@ -1426,19 +1435,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
+
+	return result;
+}
+
+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to
+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	/*
+	 * If we're not recording logical decoding information, catalog_xmin
+	 * must be unset and we don't need to do any work here.
+	 *
+	 * XXX TODO make sure we zero the checkpointed value when we turn logical decoding
+	 * off, and check it during startup!!
+	 */
+	if (!XLogLogicalInfoActive() && !force)
+	{
+		Assert(!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin));
+		Assert(!TransactionIdIsValid(procArray->replication_slot_catalog_xmin));
+	}
+
+	Assert(XLogInsertAllowed());
+
 	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
+	 * Do an unlocked check first. This is obviously race-prone especially
+	 * since replication_slot_catalog_xmin could be updated after we read
+	 * oldestCatalogXmin. But it doesn't matter if we get wrong results here,
+	 * it'll either cause us to take an unnecessary ProcArrayLock to recheck,
+	 * or delay an update until the next vacuum run.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+	slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
 
-	return result;
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -2166,14 +2249,20 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
+
+	/*
+	 * TODO: If we're on replica and using hot standby feedback to set catalog_xmin
+	 * we should be able to directly check the value reserved by feedback via shmem
+	 * from walreceiver, even if xlog replay hasn't passed that point yet.
+	 */
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2655,6 +2744,53 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+			
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
@@ -2929,18 +3065,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		 *retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		 *needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4a21d55..16c2e1f 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9cc1281..49f2082b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,7 +153,9 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
@@ -1108,3 +1111,145 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 					 nmsgs * sizeof(SharedInvalidationMessage));
 	XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
 }
+
+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		/* Reset standby wait back-off delay for each session waited for */
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but if we're an intermediate
+		 * cascading standby all we do is pass the catalog_xmin up to our
+		 * master and relay WAL down to the cascaded replica. Conflicts are the
+		 * cascaded replica's problem.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of in-use logical slots.
+		 * Inactive slots have the same effective and actual catalog_xmin, and
+		 * we'll detect conflicts with those when an attempt is made to use
+		 * them. Active slots' catalog_xmin can't go backwards unless they
+		 * become inactive.
+		 *
+		 * We specifically ignore catalog_xmin reservations >= nextXid here to allow
+		 * for slots still being created; see function comment.
+		 */
+		while (slot->in_use && slot->active_pid != 0 &&
+			   TransactionIdIsValid(slot->effective_catalog_xmin) &&
+			   (!TransactionIdIsValid(new_catalog_xmin) ||
+				TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)) &&
+			   TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->nextXid))
+		{
+			/*
+			 * Wait for the conflicting session to exit, signalling it with
+			 * a conflict if necessary.
+			 *
+			 * We'll sleep here, so release the replication slot control lock. No
+			 * new conflicts can appear "behind" our scan of the replication_slots
+			 * array because sessions check the oldestCatalogXmin on decoding
+			 * startup. This lets the exiting backend clear the slot's its
+			 * active_pid.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				/* 
+				 * As a safeguard against signalling the wrong process in case of
+				 * pid reassignment, check that the slot's active_pid hasn't been
+				 * cleared or changed. Do an unlocked read here since the worst
+				 * wrong outcome even in the case of garbage read is an extra
+				 * sleep. If you get a new backend with the same pid in the
+				 * same slot array position you have terrible luck, and it
+				 * might get cancelled with a spurious conflict. 
+				 */
+				if (active_pid != slot->active_pid)
+					continue;
+
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/*
+				 * Wait a little bit for it to die so that we avoid flooding
+				 * an unresponsive backend when system is heavily loaded.
+				 */
+				pg_usleep(5000L);
+			}
+			
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 05b2e57..0f00c10 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2262,6 +2262,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2698,8 +2701,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2781,6 +2788,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2795,12 +2803,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2855,11 +2864,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3bad417 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -242,6 +242,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 522c104..441edbe 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+
+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */
+
 } VariableCacheData;
 
 typedef VariableCacheData *VariableCache;
@@ -173,6 +177,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 4df6529..8165e19 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -118,7 +118,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -167,6 +167,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+} xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -370,6 +377,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 23731e9..6415df3 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -43,6 +43,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin; /* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b37894..b2c78ca 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -745,7 +745,8 @@ typedef enum
 	WAIT_EVENT_SYSLOGGER_MAIN,
 	WAIT_EVENT_WAL_RECEIVER_MAIN,
 	WAIT_EVENT_WAL_SENDER_MAIN,
-	WAIT_EVENT_WAL_WRITER_MAIN
+	WAIT_EVENT_WAL_WRITER_MAIN,
+	WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;
 
 /* ----------
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a83762d..943549b 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 70b3b9d..b1a4496 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -116,6 +116,9 @@ typedef struct
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.
+	 *
+	 * If hot standby feedback is enabled, a hot standby feedback message
+	 * will also be sent.
 	 */
 	bool		force_reply;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 0d5027f..4e3bc70 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
@@ -78,6 +78,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
@@ -86,6 +88,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(bool force);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index d068dde..3a3ba72 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..74713f9 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,6 +34,8 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..3f57230
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,454 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--xlog-method=stream', '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, '', 'stderr is empty');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# an initial decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+# If we change the catalogs, we'll get a conflict with recovery, but only
+# if there's an active xact when decoding. Logical decoding
+# doesn't keep a virtualxid while waiting for WAL, only when calling output
+# plugins, so this won't work damn.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+sleep(2);
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#27Michael Paquier
michael.paquier@gmail.com
In reply to: Craig Ringer (#26)
Re: Logical decoding on standby

On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

so here's a rebased series on top of master. No other changes.

Now with actual patches.

Looking at the PostgresNode code in 0001...
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos,
timeout_secs, ...)
+
This format is incorrect. I think that you also need to fix the
brackets for the do{} and the eval{] blocks.
+    push @cmd, '--endpos', $endpos if ($endpos);
endpos should be made a mandatory argument. If it is not defined that
would make the test calling this routine being stuck.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Craig Ringer
craig@2ndquadrant.com
In reply to: Michael Paquier (#27)
Re: Logical decoding on standby

On 5 January 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

so here's a rebased series on top of master. No other changes.

Now with actual patches.

Looking at the PostgresNode code in 0001...
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos,
timeout_secs, ...)
+
This format is incorrect. I think that you also need to fix the
brackets for the do{} and the eval{] blocks.
+    push @cmd, '--endpos', $endpos if ($endpos);
endpos should be made a mandatory argument. If it is not defined that
would make the test calling this routine being stuck.
--
Michael

Acknowledged and agreed. I'll fix both in the next revision. I'm
currently working on hot standby replies, but will return to this
ASAP.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Michael Paquier
michael.paquier@gmail.com
In reply to: Craig Ringer (#28)
Re: Logical decoding on standby

On Fri, Jan 6, 2017 at 1:07 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 January 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

so here's a rebased series on top of master. No other changes.

Now with actual patches.

Looking at the PostgresNode code in 0001...
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos,
timeout_secs, ...)
+
This format is incorrect. I think that you also need to fix the
brackets for the do{} and the eval{] blocks.

+ push @cmd, '--endpos', $endpos if ($endpos);
endpos should be made a mandatory argument. If it is not defined that
would make the test calling this routine being stuck.

Acknowledged and agreed. I'll fix both in the next revision. I'm
currently working on hot standby replies, but will return to this
ASAP.

By the way, be sure to fix as well the =pod blocks for the new
routines. perldoc needs to use both =pod and =item.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Thom Brown
thom@linux.com
In reply to: Craig Ringer (#26)
Re: Logical decoding on standby

On 5 January 2017 at 01:21, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

so here's a rebased series on top of master. No other changes.

Now with actual patches.

Patch 5 no longer applies:

patching file src/include/pgstat.h
Hunk #1 FAILED at 745.
1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej

--- src/include/pgstat.h
+++ src/include/pgstat.h
@@ -745,7 +745,8 @@ typedef enum
        WAIT_EVENT_SYSLOGGER_MAIN,
        WAIT_EVENT_WAL_RECEIVER_MAIN,
        WAIT_EVENT_WAL_SENDER_MAIN,
-       WAIT_EVENT_WAL_WRITER_MAIN
+       WAIT_EVENT_WAL_WRITER_MAIN,
+       WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;

/* ----------

Could you rebase?

Thanks

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Craig Ringer
craig@2ndquadrant.com
In reply to: Thom Brown (#30)
5 attachment(s)
Re: Logical decoding on standby

Rebased series attached, on top of current master (which includes
logical replicaiton).

I'm inclined to think I should split out a few of the changes from
0005, roughly along the lines of the bullet points in its commit
message. Anyone feel strongly about how granular this should be?

This patch series is a pre-requisite for supporting logical
replication using a physical standby as a source, but does not its
self enable logical replication from a physical standby.

On 23 January 2017 at 23:03, Thom Brown <thom@linux.com> wrote:

On 5 January 2017 at 01:21, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

so here's a rebased series on top of master. No other changes.

Now with actual patches.

Patch 5 no longer applies:

patching file src/include/pgstat.h
Hunk #1 FAILED at 745.
1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej

--- src/include/pgstat.h
+++ src/include/pgstat.h
@@ -745,7 +745,8 @@ typedef enum
WAIT_EVENT_SYSLOGGER_MAIN,
WAIT_EVENT_WAL_RECEIVER_MAIN,
WAIT_EVENT_WAL_SENDER_MAIN,
-       WAIT_EVENT_WAL_WRITER_MAIN
+       WAIT_EVENT_WAL_WRITER_MAIN,
+       WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
} WaitEventActivity;

/* ----------

Could you rebase?

Thanks

Thom

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchtext/x-patch; charset=US-ASCII; name=0001-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchDownload
From c77603f9359d0684ee0657d0f3f686db0e3918d4 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 15 Nov 2016 16:06:16 +0800
Subject: [PATCH 1/6] Add a pg_recvlogical wrapper to PostgresNode

---
 src/test/perl/PostgresNode.pm               | 80 ++++++++++++++++++++++++++++-
 src/test/recovery/t/006_logical_decoding.pl | 31 ++++++++++-
 2 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 18d5d12..04485c2 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1139,7 +1139,7 @@ sub psql
 			# IPC::Run::run threw an exception. re-throw unless it's a
 			# timeout, which we'll handle by testing is_expired
 			die $exc_save
-			  if (blessed($exc_save) || $exc_save ne $timeout_exception);
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
 
 			$ret = undef;
 
@@ -1520,6 +1520,84 @@ sub slot
 
 =pod
 
+=item $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...)
+
+Invoke pg_recvlogical to read from slot_name on dbname until LSN endpos, which
+corresponds to pg_recvlogical --endpos.  Gives up after timeout (if nonzero).
+
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
+
+Plugin options are passed as additional keyword arguments.
+
+If called in scalar context, returns stdout, and die()s on timeout or nonzero return.
+
+If called in array context, returns a tuple of (retval, stdout, stderr, timeout).
+timeout is the IPC::Run::Timeout object whose is_expired method can be tested
+to check for timeout. retval is undef on timeout.
+
+=cut
+
+sub pg_recvlogical_upto
+{
+	my ($self, $dbname, $slot_name, $endpos, $timeout_secs, %plugin_options) = @_;
+	my ($stdout, $stderr);
+
+	my $timeout_exception = 'pg_recvlogical timed out';
+
+	die 'slot name must be specified' unless defined($slot_name);
+	die 'endpos must be specified' unless defined($endpos);
+
+	my @cmd = ('pg_recvlogical', '-S', $slot_name, '--dbname', $self->connstr($dbname));
+	push @cmd, '--endpos', $endpos;
+	push @cmd, '-f', '-', '--no-loop', '--start';
+
+	while (my ($k, $v) = each %plugin_options)
+	{
+		die "= is not permitted to appear in replication option name" if ($k =~ qr/=/);
+		push @cmd, "-o", "$k=$v";
+	}
+
+	my $timeout;
+	$timeout = IPC::Run::timeout($timeout_secs, exception => $timeout_exception ) if $timeout_secs;
+	my $ret = 0;
+
+	do {
+		local $@;
+		eval {
+			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
+			$ret = $?;
+		};
+		my $exc_save = $@;
+		if ($exc_save)
+		{
+			# IPC::Run::run threw an exception. re-throw unless it's a
+			# timeout, which we'll handle by testing is_expired
+			die $exc_save
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
+
+			$ret = undef;
+
+			die "Got timeout exception '$exc_save' but timer not expired?!"
+			  unless $timeout->is_expired;
+
+			die "$exc_save waiting for endpos $endpos with stdout '$stdout', stderr '$stderr'"
+				unless wantarray;
+		}
+	};
+
+	if (wantarray)
+	{
+		return ($ret, $stdout, $stderr, $timeout);
+	}
+	else
+	{
+		die "pg_recvlogical exited with code '$ret', stdout '$stdout' and stderr '$stderr'" if $ret;
+		return $stdout;
+	}
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 1716360..3f249cd 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -1,9 +1,13 @@
 # Testing of logical decoding using SQL interface and/or pg_recvlogical
+#
+# Most logical decoding tests are in contrib/test_decoding. This module
+# is for work that doesn't fit well there, like where server restarts
+# are required.
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -35,5 +39,30 @@ $result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_chan
 chomp($result);
 is($result, '', 'Decoding after fast restart repeats no rows');
 
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
+
+my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
+
+$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
+
 # done with the node
 $node_master->stop;
-- 
2.5.5

0002-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0002-Follow-timeline-switches-in-logical-decoding.patchDownload
From ae2e9e1c6788cb4018464074561e474d82294053 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 2/6] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 200 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   7 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 347 insertions(+), 22 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 0de2419..cadacb5 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -660,6 +661,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -675,7 +677,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -702,6 +705,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -750,6 +754,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -770,28 +897,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= read_upto)
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
+
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 41c5000..0dfcdac 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -235,13 +235,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
+	ReplicationSlotAcquire(NameStr(*name));
+
 	/* compute the current end-of-wal */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
-	ReplicationSlotAcquire(NameStr(*name));
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	PG_TRY();
 	{
@@ -280,6 +280,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f3082c3..93c2816 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -760,6 +761,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -992,10 +999,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 00102e8..88197b9 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -160,6 +160,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 * 
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 567a7f3..25a9942 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..142a1b8 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0003-Make-snapshot-export-on-logical-slot-creation-option.patchtext/x-patch; charset=US-ASCII; name=0003-Make-snapshot-export-on-logical-slot-creation-option.patchDownload
From b42e4622746d7ce9515bb3fb90fa0e875ecde21a Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 21 Dec 2016 11:21:46 +0800
Subject: [PATCH 3/6] Make snapshot export on logical slot creation optional

Allow logical decoding slot creation via the walsender protocol's
CREATE_REPLICATION_SLOT command to optionally suppress exporting of
a snapshot when the WITHOUT_SNAPSHOT option is passed.

This means that when we allow creation of replication slots on standbys, which
cannot export snapshots, we don't have to silently omit the snapshot creation.
It also allows clients like pg_recvlogical, which neither need nor can use the
exported snapshot, to suppress its creation. Since snapshot exporting can fail
this improves reliability.
---
 doc/src/sgml/logicaldecoding.sgml      | 13 ++++++++++---
 doc/src/sgml/protocol.sgml             | 17 +++++++++++++++--
 src/backend/replication/repl_gram.y    | 13 ++++++++++---
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    |  9 ++++++++-
 src/bin/pg_basebackup/streamutil.c     |  5 +++++
 src/include/nodes/replnodes.h          |  1 +
 7 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 484915d..c0b6987 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -268,11 +268,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </para>
    </sect2>
 
-   <sect2>
+   <sect2 id="logicaldecoding-snapshot-export" xreflabel="Exported Snapshots (Logical Decoding)">
     <title>Exported Snapshots</title>
     <para>
-     When a new replication slot is created using the streaming replication interface,
-     a snapshot is exported
+     When <link linkend="protocol-replication-create-slot">a new replication
+     slot is created using the streaming replication interface</>, a snapshot
+     is exported
      (see <xref linkend="functions-snapshot-synchronization">), which will show
      exactly the state of the database after which all changes will be
      included in the change stream. This can be used to create a new replica by
@@ -282,6 +283,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
      database's state at that point in time, which afterwards can be updated
      using the slot's contents without losing any changes.
     </para>
+    <para>
+     Creation of a snapshot is not always possible - in particular, it will
+     fail when connected to a hot standby. Applications that do not require
+     snapshot export may suppress it with the <literal>WITHOUT_SNAPSHOT</>
+     option.
+    </para>
    </sect2>
   </sect1>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5f89db5..a5b156c 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1433,8 +1433,8 @@ The commands accepted in walsender mode are:
     </listitem>
   </varlistentry>
 
-  <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+  <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> [<literal>WITHOUT_SNAPSHOT</>] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1485,6 +1485,19 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>WITHOUT_SNAPSHOT</></term>
+       <listitem>
+        <para>
+         By default, logical replication slot creation exports a snapshot for
+         use in initialization; see <xref linkend="logicaldecoding-snapshot-export">.
+         Because not all clients need an exported snapshot its creation can
+         be suppressed with <literal>WITHOUT_SNAPSHOT</>.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
      </variablelist>
     </listitem>
   </varlistentry>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index d962c76..b2786cc 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -78,6 +78,7 @@ Node *replication_parse_result;
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_WITHOUT_SNAPSHOT
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,7 +91,7 @@ Node *replication_parse_result;
 %type <defelt>	plugin_opt_elem
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
-%type <boolval>	opt_reserve_wal opt_temporary
+%type <boolval>	opt_reserve_wal opt_temporary opt_without_snapshot
 
 %%
 
@@ -194,8 +195,8 @@ create_replication_slot:
 					cmd->reserve_wal = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin WITHOUT_SNAPSHOT */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT opt_without_snapshot
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
@@ -203,6 +204,7 @@ create_replication_slot:
 					cmd->slotname = $2;
 					cmd->temporary = $3;
 					cmd->plugin = $5;
+					cmd->without_snapshot = $6;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -283,6 +285,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_without_snapshot:
+			K_WITHOUT_SNAPSHOT				{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index a3b5f92..6b683aa 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -96,6 +96,7 @@ DROP_REPLICATION_SLOT		{ return K_DROP_REPLICATION_SLOT; }
 TIMELINE_HISTORY	{ return K_TIMELINE_HISTORY; }
 PHYSICAL			{ return K_PHYSICAL; }
 RESERVE_WAL			{ return K_RESERVE_WAL; }
+WITHOUT_SNAPSHOT	{ return K_WITHOUT_SNAPSHOT; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 93c2816..a1d1c0c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -849,7 +849,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * Export a plain (not of the snapbuild.c type) snapshot to the user
 		 * that can be imported into another session.
 		 */
-		snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		if (cmd->without_snapshot)
+			snapshot_name = "";
+		else if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot export a snapshot from a standby")));
+		else
+			snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
 
 		/* don't need the decoding context anymore */
 		FreeDecodingContext(ctx);
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 31290d3..3172424d 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -345,8 +345,13 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" PHYSICAL",
 						  slot_name);
 	else
+	{
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" LOGICAL \"%s\"",
 						  slot_name, plugin);
+		if (PQserverVersion(conn) >= 100000)
+			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
+			appendPQExpBuffer(query, " WITHOUT_SNAPSHOT");
+	}
 
 	res = PQexec(conn, query->data);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index f27354f..0ce21b9 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -57,6 +57,7 @@ typedef struct CreateReplicationSlotCmd
 	char	   *plugin;
 	bool		temporary;
 	bool		reserve_wal;
+	bool		without_snapshot;
 } CreateReplicationSlotCmd;
 
 
-- 
2.5.5

0004-ERROR-if-timeline-is-zero-in-walsender.patchtext/x-patch; charset=US-ASCII; name=0004-ERROR-if-timeline-is-zero-in-walsender.patchDownload
From d94740047be230f9a329431e5887b1b614e0ffe2 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 1 Jun 2016 13:50:52 +0800
Subject: [PATCH 4/6] ERROR if timeline is zero in walsender

---
 src/backend/replication/walsender.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a1d1c0c..30d01e3 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -523,6 +523,11 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 
+	if (ThisTimeLineID == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("run IDENTIFY_SYSTEM before trying to START_REPLICATION")));
+
 	/*
 	 * We assume here that we're logging enough information in the WAL for
 	 * log-shipping, since this is checked in PostmasterMain().
-- 
2.5.5

0005-Logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0005-Logical-decoding-on-standby.patchDownload
From 62735bb1e9215a7cb322df9ffe8b272380e86293 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 5/6] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
  exit with recovery conflict on upstream drop database when it has an active
  logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.
* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
  requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
  it in checkpoints, make vacuum and bgwriter advance it.
* During decoding startup check whether catalog_xmin requirement can be satisfied
  and bail out if it can not
* Add a new recovery conflict type for conflict with catalog_xmin. Abort
  in-progress logical decoding sessions with conflict with recovery where needed
  catalog_xmin is too old
* Make extra efforts to reserve master's catalog_xmin during decoding startup
  on standby.
* Try to make sure hot_standby_feedback is active when starting
  logical decoding.
* Remove checks preventing starting logical decoding on standby
---
 contrib/pg_visibility/pg_visibility.c              |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 +-
 src/backend/access/heap/heapam.c                   |   2 +-
 src/backend/access/heap/rewriteheap.c              |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c             |   9 +
 src/backend/access/transam/varsup.c                |  15 +
 src/backend/access/transam/xact.c                  |  55 +++
 src/backend/access/transam/xlog.c                  |  26 +-
 src/backend/catalog/index.c                        |   2 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/dbcommands.c                  |   6 +
 src/backend/commands/vacuum.c                      |  13 +-
 src/backend/postmaster/bgwriter.c                  |   9 +
 src/backend/postmaster/pgstat.c                    |   3 +
 src/backend/replication/logical/decode.c           |  11 +
 src/backend/replication/logical/logical.c          | 323 ++++++++++++++-
 src/backend/replication/slot.c                     |  91 ++++-
 src/backend/replication/walreceiver.c              |  52 ++-
 src/backend/replication/walsender.c                | 135 ++++--
 src/backend/storage/ipc/procarray.c                | 201 +++++++--
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  | 147 ++++++-
 src/backend/tcop/postgres.c                        |  38 +-
 src/bin/pg_controldata/pg_controldata.c            |   2 +
 src/include/access/transam.h                       |   5 +
 src/include/access/xact.h                          |  12 +-
 src/include/catalog/pg_control.h                   |   1 +
 src/include/pgstat.h                               |   3 +-
 src/include/replication/slot.h                     |   1 +
 src/include/replication/walreceiver.h              |   3 +
 src/include/storage/procarray.h                    |   9 +-
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +
 .../recovery/t/010_logical_decoding_on_replica.pl  | 454 +++++++++++++++++++++
 35 files changed, 1549 insertions(+), 129 deletions(-)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5580637..34898f6 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 8db1e20..743cbee 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a5b156c..69496b4 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1807,10 +1807,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1820,7 +1821,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled. New in 10.0.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby. New in 10.0.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1ce42ea..a32c889 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7300,7 +7300,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 90ab6f2..7bf6fd1 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -810,7 +810,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index c91ca03..e1489c1 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fc084c5..60a6319 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -393,6 +393,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
+	elog(DEBUG1, "XXX advancing catalogXmin from %u to %u", ShmemVariableCache->oldestCatalogXmin, oldestCatalogXmin);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f6f136d..38629f7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5643,6 +5643,61 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Unless logical decoding is possible on this node, we don't care about
+		 * this record.
+		 */
+		if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+			return;
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+		/*
+		 * Notify any active logical decoding sessions to terminate if they
+		 * need the catalogs we're going to be allowed to remove after
+		 * replaying this record.
+		 */
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}
+
+	return ptr;
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2f5d603..c205558 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4866,6 +4866,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -4878,6 +4879,7 @@ BootStrapXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6456,6 +6458,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6472,6 +6477,7 @@ StartupXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8532,6 +8538,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -8735,7 +8742,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9098,7 +9105,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
@@ -9289,6 +9296,16 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		/*
+		 * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+		 * require oldestCatalogXmin in checkpoints and it no longer
+		 * makes sense, so update shmem and xlog the change. This will
+		 * get written out in the next checkpoint.
+		 */
+		if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+			wal_level < WAL_LEVEL_LOGICAL)
+			UpdateOldestCatalogXmin(true);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
@@ -9457,6 +9474,7 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9555,8 +9573,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 26cbc0e..c01ff5e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2271,7 +2271,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e3e1a53..d48d0b1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -993,7 +993,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6ad8fd7..7ba856a 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2136,11 +2136,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 812fb4a..e07ead8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -488,6 +488,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin(false);
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
@@ -497,7 +506,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -909,7 +918,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4081982..a209c43 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin(false);
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..d248b2b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3309,6 +3309,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_LOGICAL_APPLY_MAIN:
 			event_name = "LogicalApplyMain";
 			break;
+		case WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE:
+			event_name = "StandbyLogicalSlotCreate";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..07a120d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..10b6be0 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,10 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void WaitForMasterCatalogXminReservation(ReplicationSlot *slot);
+
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));
+
+		/*
+		 * When decoding on a standby we need a physical slot to be used by the
+		 * walrececiver so we can pin the upstream's catalog_xmin down even
+		 * over connection loss and restarts. This also gives us somewhere to
+		 * record our needed catalog xmin on the master.
+		 */
+		if (!walrcv_has_slot)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("no replication slot configured for connection to master"),
+					 errhint("Logical decoding on standby requires that a physical replication slot be used to connect the standby to the master.")));
+
+		/*
+		 * We need hot_standby_feedback to make sure the master doesn't vacuum
+		 * away tuples we need.
+		 *
+		 * This check doesn't stop the user disabling it once we check, but they
+		 * could also drop and re-create the physical replication slot without
+		 * our noticing or do other silly things. Don't do that. If they do it
+		 * anyway we'll notice and fail with conflict with recovery later.
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback is not enabled")));
+	}
 }
 
 /*
@@ -126,6 +164,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
 	 * xmin horizons by other backends, get the safe decoding xid, and inform
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -963,3 +1011,244 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.
+ *
+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.
+ *
+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally
+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply
+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */
+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{
+	TimestampTz waitStart;
+	char	   *new_status;
+	XLogRecPtr firstWaitWalEnd, lastWaitWalEnd;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(TransactionIdIsValid(slot->effective_catalog_xmin));
+	Assert(slot->effective_catalog_xmin == slot->data.catalog_xmin);
+
+	waitStart = GetCurrentTimestamp();
+	new_status = NULL;			/* we haven't changed the ps display */
+
+	/*
+	 * The master doesn't reply to hot standby feedback explicitly,
+	 * identify which message is the most recent, nor does it report
+	 * the catalog_xmin reserved.
+	 *
+	 * This leaves a potential race. If catalog_xmin is already pinned down by
+	 * some other slot on the master or another standby,
+	 * ShmemVariableCache->oldestCatalogXmin will be set by it. We don't know
+	 * if our hot standby feedback is in effect and pinning down catalog_xmin
+	 * yet. If we start at the current oldestCatalogXmin the other slot might
+	 * advance and allow vacuum to remove tuples we need before our hot standby
+	 * feedback can lock it in. This may result in a conflict with standby at
+	 * some point after we create the slot and start decoding, when we see the
+	 * new xl_xact_catalog_xmin_advance record, unless our own catalog_xmin has
+	 * advanced enough by then that we no longer need the removed catalogs.
+	 * That can only happen if the xact holding down catalog_xmin has committed
+	 * by the time the needed catalogs are removed, so we can decode it,
+	 * advance confirmed_flush_lsn, and advance restart_lsn + catalog_xmin.
+	 *
+	 * To reduce the chances of triggering this race we force immediate
+	 * hot_standby_feedback, wait for a new latestWalEnd report from the
+	 * sender, and wait until we replay past that before we take the
+	 * catalog_xmin to start from. Without the ability to ask the walsender
+	 * to verify receipt of, and successful reservation of, a specific hot
+	 * standby feedback message this is the best we can do.
+	 *
+	 * If we lose the race, decoding will fail with a recovery conflict later.
+	 * The client will have to drop the slot and try again.
+	 *
+	 * Users can further mitigate this risk with a sufficiently high
+	 * vacuum_defer_cleanup_age.
+	 *
+	 * Users can completely prevent this problem by creating a temporary
+	 * logical slot on the master and waiting for the replica to catch up to
+	 * the master's xlog insert position before they create a slot on the
+	 * replica. Then wait until a catalog_xmin is reported on the replica's
+	 * physical slot before dropping the temporary slot on the master.
+	 *
+	 * TODO: get reply from server explicitly confirming that it has applied
+	 * our hs_feedback and what the lowest catalog_xmin it can honour is.
+	 * We'll need some kind of cookie so we can tell the server is replying
+	 * to us not someone else, especially in cascading setups.
+	 */
+
+	firstWaitWalEnd = lastWaitWalEnd = WalRcv->latestWalEnd;
+
+	WalRcvForceReply();
+
+	while (lastWaitWalEnd == firstWaitWalEnd ||
+		   GetXLogReplayRecPtr(NULL) < lastWaitWalEnd ||
+		   !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+	{
+		int ret;
+		XLogRecPtr ptr = GetXLogReplayRecPtr(NULL);
+
+		elog(DEBUG1, "XXX firstEnd %X/%X, lastEnd %X/%X; ptr %X/%X; oldestCatalogXmin %u",
+			(uint32)(firstWaitWalEnd>>32), (uint32)(firstWaitWalEnd),
+			(uint32)(lastWaitWalEnd>>32), (uint32)(lastWaitWalEnd),
+			(uint32)(ptr>>32), (uint32)(ptr),
+			ShmemVariableCache->oldestCatalogXmin);
+
+		/*
+		 * We need to advance our slot's catalog_xmin to keep pace with the
+		 * latest reported position from the master. That way we won't get
+		 * canceled with a recovery conflict when the master sends catalog_xmin
+		 * updates while we're waiting for redo to catch up with the position
+		 * we saw when we started waiting.
+		 *
+		 * A problem arises here when the server sends an
+		 * xl_xact_catalog_xmin_advance with oldestCatalogXmin = 0, indicating
+		 * it is no longer reserving catalogs. Since we're creating a slot we
+		 * don't mind, but the redo code does not know that and will treat our
+		 * process as conflicting with recovery. To guard against that we'll
+		 * advance our oldestCatalogXmin to the new
+		 * GetOldestSafeDecodingTransactionId() and redo will ignore slots
+		 * whose catalog_xmin is >= nextXid. So long as we loop faster than the
+		 * maximum standby delay we'll keep ahead of recovery cancellations.
+		 * This means we must take XidGenLock once per loop, but it's not like
+		 * we spend a lot of time creating slots.
+		 *
+		 * It's fine for our catalog_xmin to go backwards when the server
+		 * reports it has nailed down catalog_xmin so we just uncondtionally
+		 * reassign our catalog_xmin.
+		 */
+		slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+		ReplicationSlotsComputeRequiredXmin(true);
+
+		LWLockRelease(ProcArrayLock);
+
+		ret = WaitLatch(&MyProc->procLatch,
+						WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						500, WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE);
+
+		if (ret & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (ret & WL_LATCH_SET)
+			ResetLatch(&MyProc->procLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Notice if the server has reported new WAL since we sent our feedback */
+		if (lastWaitWalEnd == firstWaitWalEnd)
+			lastWaitWalEnd = WalRcv->latestWalEnd;
+
+		/* Update process title if waiting long enough */
+		if (update_process_title && new_status == NULL &&
+			TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
+									   500))
+		{
+			const char *old_status;
+			int			len;
+
+			old_status = get_ps_display(&len);
+			new_status = (char *) palloc(len + 8 + 1);
+			memcpy(new_status, old_status, len);
+			strcpy(new_status + len, " waiting");
+			set_ps_display(new_status, false);
+			new_status[len] = '\0'; /* truncate off " waiting" */
+		}
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	}
+
+	if (TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin))
+	{
+		/*
+		 * We didn't reserve the catalog_xmin we wanted, the master has already removed it.
+		 * We have to start decoding at a later point.
+		 */
+		slot->effective_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	}
+
+	ReplicationSlotsComputeRequiredXmin(true);
+
+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+
+	Assert(TransactionIdFollowsOrEquals(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin));
+	Assert(LWLockHeldByMe(ProcArrayLock));
+}
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past
+	 * the threshold we need, removing tuples that we'll require to start
+	 * decoding at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is needed
+		 * here since oldestCatalogXmin can only advance, so if it's past what
+		 * we need that's not going to change. We have marked our slot as active
+		 * so redo won't replay past our catalog_xmin without first terminating our
+		 * session.
+		 */
+		TransactionId shmem_catalog_xmin =
+			*(volatile TransactionId*)(&ShmemVariableCache->oldestCatalogXmin);
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							 NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 10d69d0..f4d4e39 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -795,6 +795,93 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ *
+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
@@ -842,7 +929,9 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
+		 * (TODO: ask walreceiver to ask walsender to log it or ask bgworker to log it)
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 0e4a4b9..6f98f16 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -508,9 +508,15 @@ WalReceiverMain(void)
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
+						if (XLogLogicalInfoActive())
+							XLogWalRcvSendHSFeedback(true);
 						XLogWalRcvSendReply(true, false);
 					}
 				}
@@ -1174,8 +1180,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	static bool master_has_standby_xmin = false;
 
@@ -1216,29 +1222,57 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+		
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 30d01e3..4bcd1d7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,7 +188,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -217,6 +216,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1556,6 +1556,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1618,7 +1623,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1639,6 +1644,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1649,59 +1670,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1726,15 +1780,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
@@ -2607,17 +2669,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2651,7 +2702,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 3f47b98..61c43ee 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1292,17 +1292,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1376,9 +1381,13 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held. Note that
+	 * we're using the effective catalog_xmin for vacuum's tuple removal here,
+	 * as copied over by UpdateOldestCatalogXmin().
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (RecoveryInProgress())
 	{
@@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
+
+	return result;
+}
+
+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to
+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	/*
+	 * If we're not recording logical decoding information, catalog_xmin
+	 * must be unset and we don't need to do any work here.
+	 *
+	 * XXX TODO make sure we zero the checkpointed value when we turn logical decoding
+	 * off, and check it during startup!!
+	 */
+	if (!XLogLogicalInfoActive() && !force)
+	{
+		Assert(!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin));
+		Assert(!TransactionIdIsValid(procArray->replication_slot_catalog_xmin));
+	}
+
+	Assert(XLogInsertAllowed());
+
 	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
+	 * Do an unlocked check first. This is obviously race-prone especially
+	 * since replication_slot_catalog_xmin could be updated after we read
+	 * oldestCatalogXmin. But it doesn't matter if we get wrong results here,
+	 * it'll either cause us to take an unnecessary ProcArrayLock to recheck,
+	 * or delay an update until the next vacuum run.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+	slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
 
-	return result;
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
+
+	/*
+	 * TODO: If we're on replica and using hot standby feedback to set catalog_xmin
+	 * we should be able to directly check the value reserved by feedback via shmem
+	 * from walreceiver, even if xlog replay hasn't passed that point yet.
+	 */
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2656,6 +2745,53 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+			
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
@@ -2930,18 +3066,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		 *retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		 *needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4a21d55..16c2e1f 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9cc1281..49f2082b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,7 +153,9 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
@@ -1108,3 +1111,145 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 					 nmsgs * sizeof(SharedInvalidationMessage));
 	XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
 }
+
+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		/* Reset standby wait back-off delay for each session waited for */
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but if we're an intermediate
+		 * cascading standby all we do is pass the catalog_xmin up to our
+		 * master and relay WAL down to the cascaded replica. Conflicts are the
+		 * cascaded replica's problem.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of in-use logical slots.
+		 * Inactive slots have the same effective and actual catalog_xmin, and
+		 * we'll detect conflicts with those when an attempt is made to use
+		 * them. Active slots' catalog_xmin can't go backwards unless they
+		 * become inactive.
+		 *
+		 * We specifically ignore catalog_xmin reservations >= nextXid here to allow
+		 * for slots still being created; see function comment.
+		 */
+		while (slot->in_use && slot->active_pid != 0 &&
+			   TransactionIdIsValid(slot->effective_catalog_xmin) &&
+			   (!TransactionIdIsValid(new_catalog_xmin) ||
+				TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)) &&
+			   TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->nextXid))
+		{
+			/*
+			 * Wait for the conflicting session to exit, signalling it with
+			 * a conflict if necessary.
+			 *
+			 * We'll sleep here, so release the replication slot control lock. No
+			 * new conflicts can appear "behind" our scan of the replication_slots
+			 * array because sessions check the oldestCatalogXmin on decoding
+			 * startup. This lets the exiting backend clear the slot's its
+			 * active_pid.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				/* 
+				 * As a safeguard against signalling the wrong process in case of
+				 * pid reassignment, check that the slot's active_pid hasn't been
+				 * cleared or changed. Do an unlocked read here since the worst
+				 * wrong outcome even in the case of garbage read is an extra
+				 * sleep. If you get a new backend with the same pid in the
+				 * same slot array position you have terrible luck, and it
+				 * might get cancelled with a spurious conflict. 
+				 */
+				if (active_pid != slot->active_pid)
+					continue;
+
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/*
+				 * Wait a little bit for it to die so that we avoid flooding
+				 * an unresponsive backend when system is heavily loaded.
+				 */
+				pg_usleep(5000L);
+			}
+			
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c15303c..e8914cf 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2270,6 +2270,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2694,8 +2697,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2777,6 +2784,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2791,12 +2799,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2851,11 +2860,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3bad417 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -242,6 +242,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 522c104..441edbe 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+
+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */
+
 } VariableCacheData;
 
 typedef VariableCacheData *VariableCache;
@@ -173,6 +177,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 4df6529..8165e19 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -118,7 +118,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -167,6 +167,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+} xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -370,6 +377,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 23731e9..6415df3 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -43,6 +43,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin; /* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..dbde85f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -747,7 +747,8 @@ typedef enum
 	WAIT_EVENT_WAL_SENDER_MAIN,
 	WAIT_EVENT_WAL_WRITER_MAIN,
 	WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-	WAIT_EVENT_LOGICAL_APPLY_MAIN
+	WAIT_EVENT_LOGICAL_APPLY_MAIN,
+	WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;
 
 /* ----------
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 62cacdb..9a2dbd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0857bdc..c8ee94c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -116,6 +116,9 @@ typedef struct
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.
+	 *
+	 * If hot standby feedback is enabled, a hot standby feedback message
+	 * will also be sent.
 	 */
 	bool		force_reply;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 0d5027f..4e3bc70 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
@@ -78,6 +78,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
@@ -86,6 +88,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(bool force);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index d068dde..3a3ba72 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..74713f9 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,6 +34,8 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..3f57230
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,454 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--xlog-method=stream', '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, '', 'stderr is empty');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# an initial decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+# If we change the catalogs, we'll get a conflict with recovery, but only
+# if there's an active xact when decoding. Logical decoding
+# doesn't keep a virtualxid while waiting for WAL, only when calling output
+# plugins, so this won't work damn.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+sleep(2);
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#32Michael Paquier
michael.paquier@gmail.com
In reply to: Craig Ringer (#31)
Re: Logical decoding on standby

On Tue, Jan 24, 2017 at 7:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

Rebased series attached, on top of current master (which includes
logical replicaiton).

I'm inclined to think I should split out a few of the changes from
0005, roughly along the lines of the bullet points in its commit
message. Anyone feel strongly about how granular this should be?

This patch series is a pre-requisite for supporting logical
replication using a physical standby as a source, but does not its
self enable logical replication from a physical standby.

There are patches but no reviews yet so moved to CF 2017-03.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#31)
Re: Logical decoding on standby

On 24 January 2017 at 06:37, Craig Ringer <craig@2ndquadrant.com> wrote:

Rebased series attached, on top of current master (which includes
logical replicaiton).

I'm inclined to think I should split out a few of the changes from
0005, roughly along the lines of the bullet points in its commit
message. Anyone feel strongly about how granular this should be?

This patch series is a pre-requisite for supporting logical
replication using a physical standby as a source, but does not its
self enable logical replication from a physical standby.

Patch 4 committed. Few others need rebase.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#33)
4 attachment(s)
Re: Logical decoding on standby

On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

Patch 4 committed. Few others need rebase.

Since this patch series and initial data copy for logical replication
both add a facility for suppressing initial snapshot export on a
logical slot, I've dropped patch 0003 (make snapshot export on logical
slot creation) in favour of Petr's similar patch.

I will duplicate it in this patch series for ease of application. (The
version here is slightly extended over Petr's so I'll re-post the
modified version on the logical replication initial data copy thread
too).

The main thing I want to direct attention to for Simon, as committer,
is the xlog'ing of VACUUM's xid threshold before we advance it and
start removing tuples. This is necessary for the standby to know
whether a given replication slot is safe to use and fail with conflict
with recovery if it is not, or if it becomes unsafe due to master
vacuum activity. Note that we can _not_ use the various vacuum records
for this because we don't know which are catalogs and which aren't;
we'd have to add a separate "is catalog" field to each vacuum xlog
record, which is undesirable. The downstream can't look up whether
it's a catalog or not because it doesn't have relcache/syscache access
during decoding.

This change might look a bit similar to the vac_truncate_clog change
in the txid_status patch, but it isn't really related. The txid_status
change is about knowing when we can safely look up xids in clog and
preventing a race with clog truncation. This change is about knowing
when we can know all catalog tuples for a given xid will still be in
the heap, not vacuumed away. Both are about making sure standbys know
more about the state of the system in a low-cost way, though.

WaitForMasterCatalogXminReservation(...) in logical.c is also worth
looking more closely at.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-option-to-control-snapshot-export-to-CREATE_REPL.patchtext/x-patch; charset=US-ASCII; name=0001-Add-option-to-control-snapshot-export-to-CREATE_REPL.patchDownload
From 02af255a84736ac2783705055d6e998e476359af Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Thu, 9 Mar 2017 14:20:28 +0100
Subject: [PATCH 1/4] Add option to control snapshot export to
 CREATE_REPLICATION_SLOT

We used to export snapshots unconditionally in CREATE_REPLICATION_SLOT
in the replication protocol, but several upcoming patches want more control
over what happens with the slot.

This means that when we allow creation of replication slots on standbys, which
cannot export snapshots because they cannot allocate new XIDs, we don't have to
silently omit the snapshot creation.

It also allows clients like pg_recvlogical, which neither need nor can use the
exported snapshot, to suppress its creation. Since snapshot exporting can fail
this improves reliability.
---
 doc/src/sgml/logicaldecoding.sgml                  | 13 +++--
 doc/src/sgml/protocol.sgml                         | 16 +++++-
 src/backend/commands/subscriptioncmds.c            |  6 ++-
 .../libpqwalreceiver/libpqwalreceiver.c            | 15 ++++--
 src/backend/replication/repl_gram.y                | 43 ++++++++++++----
 src/backend/replication/repl_scanner.l             |  2 +
 src/backend/replication/walsender.c                | 58 ++++++++++++++++++++--
 src/bin/pg_basebackup/streamutil.c                 |  5 ++
 src/include/nodes/replnodes.h                      |  2 +-
 src/include/replication/walreceiver.h              |  6 +--
 10 files changed, 140 insertions(+), 26 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 03c2c69..2b7d6e9 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -268,11 +268,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </para>
    </sect2>
 
-   <sect2>
+   <sect2 id="logicaldecoding-snapshot-export" xreflabel="Exported Snapshots (Logical Decoding)">
     <title>Exported Snapshots</title>
     <para>
-     When a new replication slot is created using the streaming replication interface,
-     a snapshot is exported
+     When <link linkend="protocol-replication-create-slot">a new replication
+     slot is created using the streaming replication interface</>, a snapshot
+     is exported
      (see <xref linkend="functions-snapshot-synchronization">), which will show
      exactly the state of the database after which all changes will be
      included in the change stream. This can be used to create a new replica by
@@ -282,6 +283,12 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
      database's state at that point in time, which afterwards can be updated
      using the slot's contents without losing any changes.
     </para>
+    <para>
+     Creation of a snapshot is not always possible - in particular, it will
+     fail when connected to a hot standby. Applications that do not require
+     snapshot export may suppress it with the <literal>NOEXPORT_SNAPSHOT</>
+     option.
+    </para>
    </sect2>
   </sect1>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 3d6e8ee..95603d3 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1487,7 +1487,7 @@ The commands accepted in walsender mode are:
   </varlistentry>
 
   <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> [ <literal>TEMPORARY</> ] { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> [ <literal>EXPORT_SNAPSHOT</> | <literal>NOEXPORT_SNAPSHOT</> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1538,6 +1538,20 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+      <varlistentry>
+       <term><literal>EXPORT_SNAPSHOT</></term>
+       <term><literal>NOEXPORT_SNAPSHOT</></term>
+       <listitem>
+        <para>
+         Decides what to do with the snapshot created during logical slot
+         initialization. <literal>EXPORT_SNAPSHOT</>, which is the
+         default, will export the snapshot for use in other sessions. This
+         option can't be used inside a transaction. The
+         <literal>NOEXPORT_SNAPSHOT</> will just use the snapshot for logical
+         decoding as normal but won't do anything else with it.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
 
      <para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 0036d99..33ccc08 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,7 +314,11 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 
 		PG_TRY();
 		{
-			walrcv_create_slot(wrconn, slotname, false, &lsn);
+			/*
+			 * Create permanent slot for the subscription, we won't use
+			 * the initial snapshot for anything so no need to export it.
+			 */
+			walrcv_create_slot(wrconn, slotname, false, false, &lsn);
 			ereport(NOTICE,
 					(errmsg("created replication slot \"%s\" on publisher",
 							slotname)));
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index ebadf36..cd2e578 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -68,6 +68,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool export_snapshot,
 								  XLogRecPtr *lsn);
 static bool libpqrcv_command(WalReceiverConn *conn,
 							 const char *cmd, char **err);
@@ -720,7 +721,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, XLogRecPtr *lsn)
+					 bool temporary, bool export_snapshot, XLogRecPtr *lsn)
 {
 	PGresult	   *res;
 	StringInfoData	cmd;
@@ -728,13 +729,19 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 
 	initStringInfo(&cmd);
 
-	appendStringInfo(&cmd, "CREATE_REPLICATION_SLOT \"%s\" ", slotname);
+	appendStringInfo(&cmd, "CREATE_REPLICATION_SLOT \"%s\"", slotname);
 
 	if (temporary)
-		appendStringInfo(&cmd, "TEMPORARY ");
+		appendStringInfo(&cmd, " TEMPORARY");
 
 	if (conn->logical)
-		appendStringInfo(&cmd, "LOGICAL pgoutput");
+	{
+		appendStringInfo(&cmd, " LOGICAL pgoutput");
+		if (export_snapshot)
+			appendStringInfo(&cmd, " EXPORT_SNAPSHOT");
+		else
+			appendStringInfo(&cmd, " NOEXPORT_SNAPSHOT");
+	}
 
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index b35d0f0..f1e43bc 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -79,6 +79,8 @@ Node *replication_parse_result;
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_EXPORT_SNAPSHOT
+%token K_NOEXPORT_SNAPSHOT
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -91,7 +93,9 @@ Node *replication_parse_result;
 %type <defelt>	plugin_opt_elem
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
-%type <boolval>	opt_reserve_wal opt_temporary
+%type <boolval>	opt_temporary
+%type <list>	create_slot_opt_list
+%type <defelt>	create_slot_opt
 
 %%
 
@@ -202,18 +206,18 @@ base_backup_opt:
 
 create_replication_slot:
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY PHYSICAL RESERVE_WAL */
-			K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_PHYSICAL opt_reserve_wal
+			K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_PHYSICAL create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_PHYSICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->reserve_wal = $5;
+					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
@@ -221,10 +225,36 @@ create_replication_slot:
 					cmd->slotname = $2;
 					cmd->temporary = $3;
 					cmd->plugin = $5;
+					cmd->options = $6;
 					$$ = (Node *) cmd;
 				}
 			;
 
+create_slot_opt_list:
+			create_slot_opt_list create_slot_opt
+				{ $$ = lappend($1, $2); }
+			| /* EMPTY */
+				{ $$ = NIL; }
+			;
+
+create_slot_opt:
+			K_EXPORT_SNAPSHOT
+				{
+				  $$ = makeDefElem("export_snapshot",
+								   (Node *)makeInteger(TRUE), -1);
+				}
+			| K_NOEXPORT_SNAPSHOT
+				{
+				  $$ = makeDefElem("export_snapshot",
+								   (Node *)makeInteger(FALSE), -1);
+				}
+			| K_RESERVE_WAL
+				{
+				  $$ = makeDefElem("reserve_wal",
+								   (Node *)makeInteger(TRUE), -1);
+				}
+			;
+
 /* DROP_REPLICATION_SLOT slot */
 drop_replication_slot:
 			K_DROP_REPLICATION_SLOT IDENT
@@ -291,11 +321,6 @@ opt_physical:
 			| /* EMPTY */
 			;
 
-opt_reserve_wal:
-			K_RESERVE_WAL					{ $$ = true; }
-			| /* EMPTY */					{ $$ = false; }
-			;
-
 opt_temporary:
 			K_TEMPORARY						{ $$ = true; }
 			| /* EMPTY */					{ $$ = false; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 37f8579..f56d41d 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -100,6 +100,8 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
+NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 
 ","				{ return ','; }
 ";"				{ return ';'; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dd3a936..127efec 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -51,6 +51,7 @@
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
+#include "commands/defrem.h"
 #include "funcapi.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -738,6 +739,48 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 }
 
 /*
+ * Process extra options given to CREATE_REPLICATION_SLOT.
+ */
+static void
+parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
+						   bool *reserve_wal,
+						   bool *export_snapshot)
+{
+	ListCell   *lc;
+	bool		snapshot_action_given = false;
+	bool		reserve_wal_given = false;
+
+	/* Parse options */
+	foreach (lc, cmd->options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(lc);
+
+		if (strcmp(defel->defname, "export_snapshot") == 0)
+		{
+			if (snapshot_action_given || cmd->kind != REPLICATION_KIND_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			snapshot_action_given = true;
+			*export_snapshot = defGetBoolean(defel);
+		}
+		else if (strcmp(defel->defname, "reserve_wal") == 0)
+		{
+			if (reserve_wal_given || cmd->kind != REPLICATION_KIND_PHYSICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			reserve_wal_given = true;
+			*reserve_wal = true;
+		}
+		else
+			elog(ERROR, "unrecognized option: %s", defel->defname);
+	}
+}
+
+/*
  * Create a new replication slot.
  */
 static void
@@ -746,6 +789,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	const char *snapshot_name = NULL;
 	char		xpos[MAXFNAMELEN];
 	char	   *slot_name;
+	bool		reserve_wal = false;
+	bool		export_snapshot = true;
 	DestReceiver *dest;
 	TupOutputState *tstate;
 	TupleDesc	tupdesc;
@@ -754,6 +799,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	Assert(!MyReplicationSlot);
 
+	parseCreateReplSlotOptions(cmd, &reserve_wal, &export_snapshot);
+
 	/* setup state for XLogReadPage */
 	sendTimeLineIsHistoric = false;
 	sendTimeLine = ThisTimeLineID;
@@ -799,10 +846,13 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		DecodingContextFindStartpoint(ctx);
 
 		/*
-		 * Export a plain (not of the snapbuild.c type) snapshot to the user
-		 * that can be imported into another session.
+		 * Export the snapshot if we've been asked to do so.
+		 *
+		 * NB. We will convert the snapbuild.c kind of snapshot to normal
+		 * snapshot when doing this.
 		 */
-		snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
+		if (export_snapshot)
+			snapshot_name = SnapBuildExportSnapshot(ctx->snapshot_builder);
 
 		/* don't need the decoding context anymore */
 		FreeDecodingContext(ctx);
@@ -810,7 +860,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		if (!cmd->temporary)
 			ReplicationSlotPersist();
 	}
-	else if (cmd->kind == REPLICATION_KIND_PHYSICAL && cmd->reserve_wal)
+	else if (cmd->kind == REPLICATION_KIND_PHYSICAL && reserve_wal)
 	{
 		ReplicationSlotReserveWal();
 
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 1fe42ef..507da5e 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -338,8 +338,13 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" PHYSICAL",
 						  slot_name);
 	else
+	{
 		appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\" LOGICAL \"%s\"",
 						  slot_name, plugin);
+		if (PQserverVersion(conn) >= 100000)
+			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
+			appendPQExpBuffer(query, " NOEXPORT_SNAPSHOT");
+	}
 
 	res = PQexec(conn, query->data);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index f27354f..996da3c 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,7 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
-	bool		reserve_wal;
+	List	   *options;
 } CreateReplicationSlotCmd;
 
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0857bdc..78e577c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -183,7 +183,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn, const char *buffer,
 								int nbytes);
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname, bool temporary,
-										XLogRecPtr *lsn);
+										bool export_snapshot, XLogRecPtr *lsn);
 typedef bool (*walrcv_command_fn) (WalReceiverConn *conn, const char *cmd,
 								   char **err);
 typedef void (*walrcv_disconnect_fn) (WalReceiverConn *conn);
@@ -224,8 +224,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, export_snapshot, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, export_snapshot, lsn)
 #define walrcv_command(conn, cmd, err) \
 	WalReceiverFunctions->walrcv_command(conn, cmd, err)
 #define walrcv_disconnect(conn) \
-- 
2.5.5

0002-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchtext/x-patch; charset=US-ASCII; name=0002-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchDownload
From b432672bad487ea4c6ac9141298185b9c4ef892e Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 15 Nov 2016 16:06:16 +0800
Subject: [PATCH 2/4] Add a pg_recvlogical wrapper to PostgresNode

---
 src/test/perl/PostgresNode.pm               | 78 +++++++++++++++++++++++++++++
 src/test/recovery/t/006_logical_decoding.pl | 31 +++++++++++-
 2 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e5cb348..6e66e5c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1520,6 +1520,84 @@ sub slot
 
 =pod
 
+=item $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...)
+
+Invoke pg_recvlogical to read from slot_name on dbname until LSN endpos, which
+corresponds to pg_recvlogical --endpos.  Gives up after timeout (if nonzero).
+
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
+
+Plugin options are passed as additional keyword arguments.
+
+If called in scalar context, returns stdout, and die()s on timeout or nonzero return.
+
+If called in array context, returns a tuple of (retval, stdout, stderr, timeout).
+timeout is the IPC::Run::Timeout object whose is_expired method can be tested
+to check for timeout. retval is undef on timeout.
+
+=cut
+
+sub pg_recvlogical_upto
+{
+	my ($self, $dbname, $slot_name, $endpos, $timeout_secs, %plugin_options) = @_;
+	my ($stdout, $stderr);
+
+	my $timeout_exception = 'pg_recvlogical timed out';
+
+	die 'slot name must be specified' unless defined($slot_name);
+	die 'endpos must be specified' unless defined($endpos);
+
+	my @cmd = ('pg_recvlogical', '-S', $slot_name, '--dbname', $self->connstr($dbname));
+	push @cmd, '--endpos', $endpos;
+	push @cmd, '-f', '-', '--no-loop', '--start';
+
+	while (my ($k, $v) = each %plugin_options)
+	{
+		die "= is not permitted to appear in replication option name" if ($k =~ qr/=/);
+		push @cmd, "-o", "$k=$v";
+	}
+
+	my $timeout;
+	$timeout = IPC::Run::timeout($timeout_secs, exception => $timeout_exception ) if $timeout_secs;
+	my $ret = 0;
+
+	do {
+		local $@;
+		eval {
+			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
+			$ret = $?;
+		};
+		my $exc_save = $@;
+		if ($exc_save)
+		{
+			# IPC::Run::run threw an exception. re-throw unless it's a
+			# timeout, which we'll handle by testing is_expired
+			die $exc_save
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
+
+			$ret = undef;
+
+			die "Got timeout exception '$exc_save' but timer not expired?!"
+			  unless $timeout->is_expired;
+
+			die "$exc_save waiting for endpos $endpos with stdout '$stdout', stderr '$stderr'"
+				unless wantarray;
+		}
+	};
+
+	if (wantarray)
+	{
+		return ($ret, $stdout, $stderr, $timeout);
+	}
+	else
+	{
+		die "pg_recvlogical exited with code '$ret', stdout '$stdout' and stderr '$stderr'" if $ret;
+		return $stdout;
+	}
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 1716360..3f249cd 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -1,9 +1,13 @@
 # Testing of logical decoding using SQL interface and/or pg_recvlogical
+#
+# Most logical decoding tests are in contrib/test_decoding. This module
+# is for work that doesn't fit well there, like where server restarts
+# are required.
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -35,5 +39,30 @@ $result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_chan
 chomp($result);
 is($result, '', 'Decoding after fast restart repeats no rows');
 
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
+
+my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
+
+$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
+
 # done with the node
 $node_master->stop;
-- 
2.5.5

0003-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0003-Follow-timeline-switches-in-logical-decoding.patchDownload
From f3e6d09ba60fd31e9f63d35905bcd772e60a7f9d Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 3/4] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 200 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   7 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 347 insertions(+), 22 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 8b99b78..c9efff4 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -661,6 +662,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -676,7 +678,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -703,6 +706,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -751,6 +755,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -771,28 +898,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= read_upto)
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
+
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 41c5000..0dfcdac 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -235,13 +235,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
+	ReplicationSlotAcquire(NameStr(*name));
+
 	/* compute the current end-of-wal */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
-	ReplicationSlotAcquire(NameStr(*name));
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	PG_TRY();
 	{
@@ -280,6 +280,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 127efec..0ecf7b0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -48,6 +48,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -719,6 +720,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -972,10 +979,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 663d3e7..12fa274 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -161,6 +161,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 * 
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 567a7f3..25a9942 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..142a1b8 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0004-Logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0004-Logical-decoding-on-standby.patchDownload
From 59fa05e06d7cb3fc31074ddb0e938515a1130eb2 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 4/4] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
  exit with recovery conflict on upstream drop database when it has an active
  logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.
* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
  requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
  it in checkpoints, make vacuum and bgwriter advance it.
* During decoding startup check whether catalog_xmin requirement can be satisfied
  and bail out if it can not
* Add a new recovery conflict type for conflict with catalog_xmin. Abort
  in-progress logical decoding sessions with conflict with recovery where needed
  catalog_xmin is too old
* Make extra efforts to reserve master's catalog_xmin during decoding startup
  on standby.
* Try to make sure hot_standby_feedback is active when starting
  logical decoding.
* Remove checks preventing starting logical decoding on standby
---
 contrib/pg_visibility/pg_visibility.c              |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 +-
 src/backend/access/heap/heapam.c                   |   2 +-
 src/backend/access/heap/rewriteheap.c              |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c             |   9 +
 src/backend/access/transam/varsup.c                |  14 +
 src/backend/access/transam/xact.c                  |  55 +++
 src/backend/access/transam/xlog.c                  |  26 +-
 src/backend/catalog/index.c                        |   2 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/dbcommands.c                  |   6 +
 src/backend/commands/vacuum.c                      |  13 +-
 src/backend/postmaster/bgwriter.c                  |   9 +
 src/backend/postmaster/pgstat.c                    |   3 +
 src/backend/replication/logical/decode.c           |  11 +
 src/backend/replication/logical/logical.c          | 318 ++++++++++++++-
 src/backend/replication/slot.c                     |  91 ++++-
 src/backend/replication/walreceiver.c              |  52 ++-
 src/backend/replication/walsender.c                | 135 ++++--
 src/backend/storage/ipc/procarray.c                | 201 +++++++--
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  | 147 ++++++-
 src/backend/tcop/postgres.c                        |  38 +-
 src/bin/pg_controldata/pg_controldata.c            |   2 +
 src/include/access/transam.h                       |   5 +
 src/include/access/xact.h                          |  12 +-
 src/include/catalog/pg_control.h                   |   1 +
 src/include/pgstat.h                               |   3 +-
 src/include/replication/slot.h                     |   1 +
 src/include/replication/walreceiver.h              |   3 +
 src/include/storage/procarray.h                    |   9 +-
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +
 .../recovery/t/010_logical_decoding_on_replica.pl  | 454 +++++++++++++++++++++
 35 files changed, 1543 insertions(+), 129 deletions(-)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5580637..34898f6 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 8db1e20..743cbee 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 95603d3..2343298 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1910,10 +1910,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1923,7 +1924,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled. New in 10.0.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby. New in 10.0.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index af25836..efbaaf0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7312,7 +7312,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c7b283c..30cf0f4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -810,7 +810,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..96ea163 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 42fc351..b6bee35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -393,6 +393,20 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 82f9a3c..da6952f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5643,6 +5643,61 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Unless logical decoding is possible on this node, we don't care about
+		 * this record.
+		 */
+		if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+			return;
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+		/*
+		 * Notify any active logical decoding sessions to terminate if they
+		 * need the catalogs we're going to be allowed to remove after
+		 * replaying this record.
+		 */
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}
+
+	return ptr;
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 744360c..0649de8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4983,6 +4983,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -4995,6 +4996,7 @@ BootStrapXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6581,6 +6583,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6597,6 +6602,7 @@ StartupXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8666,6 +8672,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -8869,7 +8876,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9232,7 +9239,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
@@ -9423,6 +9430,16 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		/*
+		 * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+		 * require oldestCatalogXmin in checkpoints and it no longer
+		 * makes sense, so update shmem and xlog the change. This will
+		 * get written out in the next checkpoint.
+		 */
+		if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+			wal_level < WAL_LEVEL_LOGICAL)
+			UpdateOldestCatalogXmin(true);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
@@ -9591,6 +9608,7 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9689,8 +9707,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8d42a34..7ce7c8f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2270,7 +2270,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b91df98..0f166a0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1000,7 +1000,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5a63b1a..052957b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2124,11 +2124,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3a9b965..5f32bcd 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin(false);
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
@@ -527,7 +536,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -939,7 +948,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..47fd265 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin(false);
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2fb9a8b..58e58c5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3309,6 +3309,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_LOGICAL_APPLY_MAIN:
 			event_name = "LogicalApplyMain";
 			break;
+		case WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE:
+			event_name = "StandbyLogicalSlotCreate";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..07a120d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..e5f812f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,10 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void WaitForMasterCatalogXminReservation(ReplicationSlot *slot);
+
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));
+
+		/*
+		 * When decoding on a standby we need a physical slot to be used by the
+		 * walrececiver so we can pin the upstream's catalog_xmin down even
+		 * over connection loss and restarts. This also gives us somewhere to
+		 * record our needed catalog xmin on the master.
+		 */
+		if (!walrcv_has_slot)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("no replication slot configured for connection to master"),
+					 errhint("Logical decoding on standby requires that a physical replication slot be used to connect the standby to the master.")));
+
+		/*
+		 * We need hot_standby_feedback to make sure the master doesn't vacuum
+		 * away tuples we need.
+		 *
+		 * This check doesn't stop the user disabling it once we check, but they
+		 * could also drop and re-create the physical replication slot without
+		 * our noticing or do other silly things. Don't do that. If they do it
+		 * anyway we'll notice and fail with conflict with recovery later.
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback is not enabled")));
+	}
 }
 
 /*
@@ -126,6 +164,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
 	 * xmin horizons by other backends, get the safe decoding xid, and inform
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -963,3 +1011,239 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.
+ *
+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.
+ *
+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally
+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply
+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */
+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{
+	TimestampTz waitStart;
+	char	   *new_status;
+	XLogRecPtr firstWaitWalEnd, lastWaitWalEnd;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(TransactionIdIsValid(slot->effective_catalog_xmin));
+	Assert(slot->effective_catalog_xmin == slot->data.catalog_xmin);
+
+	waitStart = GetCurrentTimestamp();
+	new_status = NULL;			/* we haven't changed the ps display */
+
+	/*
+	 * The master doesn't reply to hot standby feedback explicitly,
+	 * identify which message is the most recent, nor does it report
+	 * the catalog_xmin reserved.
+	 *
+	 * This leaves a potential race. If catalog_xmin is already pinned down by
+	 * some other slot on the master or another standby,
+	 * ShmemVariableCache->oldestCatalogXmin will be set by it. We don't know
+	 * if our hot standby feedback is in effect and pinning down catalog_xmin
+	 * yet. If we start at the current oldestCatalogXmin the other slot might
+	 * advance and allow vacuum to remove tuples we need before our hot standby
+	 * feedback can lock it in. This may result in a conflict with standby at
+	 * some point after we create the slot and start decoding, when we see the
+	 * new xl_xact_catalog_xmin_advance record, unless our own catalog_xmin has
+	 * advanced enough by then that we no longer need the removed catalogs.
+	 * That can only happen if the xact holding down catalog_xmin has committed
+	 * by the time the needed catalogs are removed, so we can decode it,
+	 * advance confirmed_flush_lsn, and advance restart_lsn + catalog_xmin.
+	 *
+	 * To reduce the chances of triggering this race we force immediate
+	 * hot_standby_feedback, wait for a new latestWalEnd report from the
+	 * sender, and wait until we replay past that before we take the
+	 * catalog_xmin to start from. Without the ability to ask the walsender
+	 * to verify receipt of, and successful reservation of, a specific hot
+	 * standby feedback message this is the best we can do.
+	 *
+	 * If we lose the race, decoding will fail with a recovery conflict later.
+	 * The client will have to drop the slot and try again.
+	 *
+	 * Users can further mitigate this risk with a sufficiently high
+	 * vacuum_defer_cleanup_age.
+	 *
+	 * Users can completely prevent this problem by creating a temporary
+	 * logical slot on the master and waiting for the replica to catch up to
+	 * the master's xlog insert position before they create a slot on the
+	 * replica. Then wait until a catalog_xmin is reported on the replica's
+	 * physical slot before dropping the temporary slot on the master.
+	 *
+	 * What we'd really like is to get reply from server explicitly
+	 * confirming that it has applied our hs_feedback and what the lowest
+	 * catalog_xmin it can honour is. This turns out to be tricky to do
+	 * through a cascade, so for now we'll tolerate slow slot creation
+	 * and a small race risk.
+	 */
+
+	firstWaitWalEnd = WalRcv->latestWalEnd;
+	lastWaitWalEnd = firstWaitWalEnd;
+
+	WalRcvForceReply();
+
+	while (lastWaitWalEnd == firstWaitWalEnd ||
+		   GetXLogReplayRecPtr(NULL) < lastWaitWalEnd ||
+		   !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+	{
+		int ret;
+
+		/*
+		 * We need to advance our slot's catalog_xmin to keep pace with the
+		 * latest reported position from the master. That way we won't get
+		 * canceled with a recovery conflict when the master sends catalog_xmin
+		 * updates while we're waiting for redo to catch up with the position
+		 * we saw when we started waiting.
+		 *
+		 * A problem arises here when the server sends an
+		 * xl_xact_catalog_xmin_advance with oldestCatalogXmin = 0, indicating
+		 * it is no longer reserving catalogs. Since we're creating a slot we
+		 * don't mind, but the redo code does not know that and will treat our
+		 * process as conflicting with recovery. To guard against that we'll
+		 * advance our oldestCatalogXmin to the new
+		 * GetOldestSafeDecodingTransactionId() and redo will ignore slots
+		 * whose catalog_xmin is >= nextXid. So long as we loop faster than the
+		 * maximum standby delay we'll keep ahead of recovery cancellations.
+		 * This means we must take XidGenLock once per loop, but it's not like
+		 * we spend a lot of time creating slots.
+		 *
+		 * It's fine for our catalog_xmin to go backwards when the server
+		 * reports it has nailed down catalog_xmin so we just uncondtionally
+		 * reassign our catalog_xmin.
+		 */
+		slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+		ReplicationSlotsComputeRequiredXmin(true);
+
+		LWLockRelease(ProcArrayLock);
+
+		ret = WaitLatch(&MyProc->procLatch,
+						WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						500, WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE);
+
+		if (ret & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (ret & WL_LATCH_SET)
+			ResetLatch(&MyProc->procLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Notice if the server has reported new WAL since we sent our feedback */
+		if (lastWaitWalEnd == firstWaitWalEnd)
+			lastWaitWalEnd = WalRcv->latestWalEnd;
+
+		/* Update process title if waiting long enough */
+		if (update_process_title && new_status == NULL &&
+			TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
+									   500))
+		{
+			const char *old_status;
+			int			len;
+
+			old_status = get_ps_display(&len);
+			new_status = (char *) palloc(len + 8 + 1);
+			memcpy(new_status, old_status, len);
+			strcpy(new_status + len, " waiting");
+			set_ps_display(new_status, false);
+			new_status[len] = '\0'; /* truncate off " waiting" */
+		}
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	}
+
+	if (TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin))
+	{
+		/*
+		 * We didn't reserve the catalog_xmin we wanted, the master has already removed it.
+		 * We have to start decoding at a later point.
+		 */
+		slot->effective_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	}
+
+	ReplicationSlotsComputeRequiredXmin(true);
+
+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+
+	Assert(TransactionIdFollowsOrEquals(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin));
+	Assert(LWLockHeldByMe(ProcArrayLock));
+}
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past
+	 * the threshold we need, removing tuples that we'll require to start
+	 * decoding at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is needed
+		 * here since oldestCatalogXmin can only advance, so if it's past what
+		 * we need that's not going to change. We have marked our slot as active
+		 * so redo won't replay past our catalog_xmin without first terminating our
+		 * session.
+		 */
+		TransactionId shmem_catalog_xmin =
+			*(volatile TransactionId*)(&ShmemVariableCache->oldestCatalogXmin);
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							 NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 10d69d0..f4d4e39 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -795,6 +795,93 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ *
+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
@@ -842,7 +929,9 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
+		 * (TODO: ask walreceiver to ask walsender to log it or ask bgworker to log it)
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 18d9d7e..2236c5d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -508,9 +508,15 @@ WalReceiverMain(void)
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
+						if (XLogLogicalInfoActive())
+							XLogWalRcvSendHSFeedback(true);
 						XLogWalRcvSendReply(true, false);
 					}
 				}
@@ -1175,8 +1181,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1227,57 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+		
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0ecf7b0..74bd405 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -192,7 +192,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -221,6 +220,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1541,6 +1541,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1603,7 +1608,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1624,6 +1629,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1634,59 +1655,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1711,15 +1765,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
@@ -2584,17 +2646,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2628,7 +2679,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cd14667..85dce17 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1292,17 +1292,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1376,9 +1381,13 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held. Note that
+	 * we're using the effective catalog_xmin for vacuum's tuple removal here,
+	 * as copied over by UpdateOldestCatalogXmin().
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (RecoveryInProgress())
 	{
@@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
+
+	return result;
+}
+
+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to
+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	/*
+	 * If we're not recording logical decoding information, catalog_xmin
+	 * must be unset and we don't need to do any work here.
+	 *
+	 * XXX TODO make sure we zero the checkpointed value when we turn logical decoding
+	 * off, and check it during startup!!
+	 */
+	if (!XLogLogicalInfoActive() && !force)
+	{
+		Assert(!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin));
+		Assert(!TransactionIdIsValid(procArray->replication_slot_catalog_xmin));
+	}
+
+	Assert(XLogInsertAllowed());
+
 	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
+	 * Do an unlocked check first. This is obviously race-prone especially
+	 * since replication_slot_catalog_xmin could be updated after we read
+	 * oldestCatalogXmin. But it doesn't matter if we get wrong results here,
+	 * it'll either cause us to take an unnecessary ProcArrayLock to recheck,
+	 * or delay an update until the next vacuum run.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+	slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
 
-	return result;
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
+
+	/*
+	 * TODO: If we're on replica and using hot standby feedback to set catalog_xmin
+	 * we should be able to directly check the value reserved by feedback via shmem
+	 * from walreceiver, even if xlog replay hasn't passed that point yet.
+	 */
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2656,6 +2745,53 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+			
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
@@ -2964,18 +3100,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		 *retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		 *needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4a21d55..16c2e1f 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..2695fa2 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,7 +153,9 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
@@ -1110,3 +1113,145 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 					 nmsgs * sizeof(SharedInvalidationMessage));
 	XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
 }
+
+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		/* Reset standby wait back-off delay for each session waited for */
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but if we're an intermediate
+		 * cascading standby all we do is pass the catalog_xmin up to our
+		 * master and relay WAL down to the cascaded replica. Conflicts are the
+		 * cascaded replica's problem.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of in-use logical slots.
+		 * Inactive slots have the same effective and actual catalog_xmin, and
+		 * we'll detect conflicts with those when an attempt is made to use
+		 * them. Active slots' catalog_xmin can't go backwards unless they
+		 * become inactive.
+		 *
+		 * We specifically ignore catalog_xmin reservations >= nextXid here to allow
+		 * for slots still being created; see function comment.
+		 */
+		while (slot->in_use && slot->active_pid != 0 &&
+			   TransactionIdIsValid(slot->effective_catalog_xmin) &&
+			   (!TransactionIdIsValid(new_catalog_xmin) ||
+				TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)) &&
+			   TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->nextXid))
+		{
+			/*
+			 * Wait for the conflicting session to exit, signalling it with
+			 * a conflict if necessary.
+			 *
+			 * We'll sleep here, so release the replication slot control lock. No
+			 * new conflicts can appear "behind" our scan of the replication_slots
+			 * array because sessions check the oldestCatalogXmin on decoding
+			 * startup. This lets the exiting backend clear the slot's its
+			 * active_pid.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				/* 
+				 * As a safeguard against signalling the wrong process in case of
+				 * pid reassignment, check that the slot's active_pid hasn't been
+				 * cleared or changed. Do an unlocked read here since the worst
+				 * wrong outcome even in the case of garbage read is an extra
+				 * sleep. If you get a new backend with the same pid in the
+				 * same slot array position you have terrible luck, and it
+				 * might get cancelled with a spurious conflict. 
+				 */
+				if (active_pid != slot->active_pid)
+					continue;
+
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/*
+				 * Wait a little bit for it to die so that we avoid flooding
+				 * an unresponsive backend when system is heavily loaded.
+				 */
+				pg_usleep(5000L);
+			}
+			
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b07d6c6..8f69cfe 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2270,6 +2270,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2692,8 +2695,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2775,6 +2782,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2849,11 +2858,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 522c104..441edbe 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+
+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */
+
 } VariableCacheData;
 
 typedef VariableCacheData *VariableCache;
@@ -173,6 +177,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e7d1191..d40bd4c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -118,7 +118,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -167,6 +167,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+} xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -370,6 +377,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..ef33014 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin; /* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0062fb8..c7a341d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -747,7 +747,8 @@ typedef enum
 	WAIT_EVENT_WAL_SENDER_MAIN,
 	WAIT_EVENT_WAL_WRITER_MAIN,
 	WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-	WAIT_EVENT_LOGICAL_APPLY_MAIN
+	WAIT_EVENT_LOGICAL_APPLY_MAIN,
+	WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;
 
 /* ----------
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 62cacdb..9a2dbd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 78e577c..74ae4bf 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -116,6 +116,9 @@ typedef struct
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.
+	 *
+	 * If hot standby feedback is enabled, a hot standby feedback message
+	 * will also be sent.
 	 */
 	bool		force_reply;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9d5a13e..aa35cf7 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
@@ -79,6 +79,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
@@ -87,6 +89,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(bool force);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index d068dde..3a3ba72 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..74713f9 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,6 +34,8 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..ce8a6af
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,454 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, '', 'stderr is empty');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# an initial decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+# If we change the catalogs, we'll get a conflict with recovery, but only
+# if there's an active xact when decoding. Logical decoding
+# doesn't keep a virtualxid while waiting for WAL, only when calling output
+# plugins, so this won't work damn.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+sleep(2);
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#35Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#34)
Re: Logical decoding on standby

On 13 March 2017 at 10:56, Craig Ringer <craig@2ndquadrant.com> wrote:

On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

Patch 4 committed. Few others need rebase.

Since this patch series and initial data copy for logical replication
both add a facility for suppressing initial snapshot export on a
logical slot, I've dropped patch 0003 (make snapshot export on logical
slot creation) in favour of Petr's similar patch.

I will duplicate it in this patch series for ease of application. (The
version here is slightly extended over Petr's so I'll re-post the
modified version on the logical replication initial data copy thread
too).

The main thing I want to direct attention to for Simon, as committer,
is the xlog'ing of VACUUM's xid threshold before we advance it and
start removing tuples. This is necessary for the standby to know
whether a given replication slot is safe to use and fail with conflict
with recovery if it is not, or if it becomes unsafe due to master
vacuum activity. Note that we can _not_ use the various vacuum records
for this because we don't know which are catalogs and which aren't;
we'd have to add a separate "is catalog" field to each vacuum xlog
record, which is undesirable. The downstream can't look up whether
it's a catalog or not because it doesn't have relcache/syscache access
during decoding.

This change might look a bit similar to the vac_truncate_clog change
in the txid_status patch, but it isn't really related. The txid_status
change is about knowing when we can safely look up xids in clog and
preventing a race with clog truncation. This change is about knowing
when we can know all catalog tuples for a given xid will still be in
the heap, not vacuumed away. Both are about making sure standbys know
more about the state of the system in a low-cost way, though.

WaitForMasterCatalogXminReservation(...) in logical.c is also worth
looking more closely at.

I should also note that because the TAP tests currently take a long
time, I recommend skipping the tests for this patch by default and
running them only when actually touching logical decoding.

I'm looking at ways to make them faster, but they're inevitably going
to take a while until we can get hot standby feedback replies in
place, including cascading support. Which I have as WIP, but won't
make this release.

Changing the test import to

use Test::More skip_all => "disabled by default, too slow";

will be sufficient.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#34)
Re: Logical decoding on standby

On 13 March 2017 at 10:56, Craig Ringer <craig@2ndquadrant.com> wrote:

On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

Patch 4 committed. Few others need rebase.

Since this patch series

Patch 1 fails since feature has already been applied. If other reason,
let me know.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#36)
3 attachment(s)
Re: Logical decoding on standby

On 19 March 2017 at 18:02, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

Patch 1 fails since feature has already been applied. If other reason,
let me know.

Nope, that's fine.

Rebased attached.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0002-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0002-Follow-timeline-switches-in-logical-decoding.patchDownload
From 2fa891a555ea4fb200d75b8c906c6b932699b463 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 2/3] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 200 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   7 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 347 insertions(+), 22 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b2b9fcb..5a51be0 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -662,6 +663,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -677,7 +679,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -704,6 +707,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -754,6 +758,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -774,28 +901,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= read_upto)
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
+
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 41c5000..0dfcdac 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -235,13 +235,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
+	ReplicationSlotAcquire(NameStr(*name));
+
 	/* compute the current end-of-wal */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
-	ReplicationSlotAcquire(NameStr(*name));
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	PG_TRY();
 	{
@@ -280,6 +280,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0f6b828..a00c204 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -48,6 +48,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -721,6 +722,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -974,10 +981,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 663d3e7..12fa274 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -161,6 +161,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 * 
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 567a7f3..25a9942 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..142a1b8 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0003-Logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0003-Logical-decoding-on-standby.patchDownload
From 8854d44e2227b9d076b0a25a9c8b9df9270b2433 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 3/3] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
  exit with recovery conflict on upstream drop database when it has an active
  logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.
* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
  requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
  it in checkpoints, make vacuum and bgwriter advance it.
* During decoding startup check whether catalog_xmin requirement can be satisfied
  and bail out if it can not
* Add a new recovery conflict type for conflict with catalog_xmin. Abort
  in-progress logical decoding sessions with conflict with recovery where needed
  catalog_xmin is too old
* Make extra efforts to reserve master's catalog_xmin during decoding startup
  on standby.
* Try to make sure hot_standby_feedback is active when starting
  logical decoding.
* Remove checks preventing starting logical decoding on standby
---
 contrib/pg_visibility/pg_visibility.c              |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 +-
 src/backend/access/heap/heapam.c                   |   2 +-
 src/backend/access/heap/rewriteheap.c              |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c             |   9 +
 src/backend/access/transam/varsup.c                |  14 +
 src/backend/access/transam/xact.c                  |  55 +++
 src/backend/access/transam/xlog.c                  |  26 +-
 src/backend/catalog/index.c                        |   2 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/dbcommands.c                  |   6 +
 src/backend/commands/vacuum.c                      |  13 +-
 src/backend/postmaster/bgwriter.c                  |   9 +
 src/backend/postmaster/pgstat.c                    |   3 +
 src/backend/replication/logical/decode.c           |  11 +
 src/backend/replication/logical/logical.c          | 318 ++++++++++++++-
 src/backend/replication/slot.c                     |  91 ++++-
 src/backend/replication/walreceiver.c              |  52 ++-
 src/backend/replication/walsender.c                | 135 ++++--
 src/backend/storage/ipc/procarray.c                | 201 +++++++--
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  | 147 ++++++-
 src/backend/tcop/postgres.c                        |  38 +-
 src/bin/pg_controldata/pg_controldata.c            |   2 +
 src/include/access/transam.h                       |   5 +
 src/include/access/xact.h                          |  12 +-
 src/include/catalog/pg_control.h                   |   1 +
 src/include/pgstat.h                               |   3 +-
 src/include/replication/slot.h                     |   1 +
 src/include/replication/walreceiver.h              |   3 +
 src/include/storage/procarray.h                    |   9 +-
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +
 .../recovery/t/010_logical_decoding_on_replica.pl  | 454 +++++++++++++++++++++
 35 files changed, 1543 insertions(+), 129 deletions(-)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index d0f7618..6261e68 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -557,7 +557,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -674,7 +674,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 8db1e20..743cbee 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 244e381..0cb6809 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1911,10 +1911,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1924,7 +1925,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled. New in 10.0.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby. New in 10.0.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8526137..07b8fa7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7328,7 +7328,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..36bbb98 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..96ea163 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 42fc351..b6bee35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -393,6 +393,20 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 02e0779..a9edf4a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5643,6 +5643,61 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Unless logical decoding is possible on this node, we don't care about
+		 * this record.
+		 */
+		if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+			return;
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+		/*
+		 * Notify any active logical decoding sessions to terminate if they
+		 * need the catalogs we're going to be allowed to remove after
+		 * replaying this record.
+		 */
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}
+
+	return ptr;
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9480377..580727b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5005,6 +5005,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -5017,6 +5018,7 @@ BootStrapXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6607,6 +6609,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6623,6 +6628,7 @@ StartupXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8692,6 +8698,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -8895,7 +8902,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9258,7 +9265,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
@@ -9449,6 +9456,16 @@ XLogReportParameters(void)
 			XLogFlush(recptr);
 		}
 
+		/*
+		 * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+		 * require oldestCatalogXmin in checkpoints and it no longer
+		 * makes sense, so update shmem and xlog the change. This will
+		 * get written out in the next checkpoint.
+		 */
+		if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+			wal_level < WAL_LEVEL_LOGICAL)
+			UpdateOldestCatalogXmin(true);
+
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
@@ -9617,6 +9634,7 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9715,8 +9733,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8d42a34..7ce7c8f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2270,7 +2270,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b91df98..0f166a0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1000,7 +1000,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5a63b1a..052957b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2124,11 +2124,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ff633fa..2d16bf0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin(false);
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
@@ -527,7 +536,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -939,7 +948,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..47fd265 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin(false);
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3a50488..b06b7eb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3320,6 +3320,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_LOGICAL_APPLY_MAIN:
 			event_name = "LogicalApplyMain";
 			break;
+		case WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE:
+			event_name = "StandbyLogicalSlotCreate";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..07a120d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..e5f812f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,10 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void WaitForMasterCatalogXminReservation(ReplicationSlot *slot);
+
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));
+
+		/*
+		 * When decoding on a standby we need a physical slot to be used by the
+		 * walrececiver so we can pin the upstream's catalog_xmin down even
+		 * over connection loss and restarts. This also gives us somewhere to
+		 * record our needed catalog xmin on the master.
+		 */
+		if (!walrcv_has_slot)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("no replication slot configured for connection to master"),
+					 errhint("Logical decoding on standby requires that a physical replication slot be used to connect the standby to the master.")));
+
+		/*
+		 * We need hot_standby_feedback to make sure the master doesn't vacuum
+		 * away tuples we need.
+		 *
+		 * This check doesn't stop the user disabling it once we check, but they
+		 * could also drop and re-create the physical replication slot without
+		 * our noticing or do other silly things. Don't do that. If they do it
+		 * anyway we'll notice and fail with conflict with recovery later.
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback is not enabled")));
+	}
 }
 
 /*
@@ -126,6 +164,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
 	 * xmin horizons by other backends, get the safe decoding xid, and inform
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -963,3 +1011,239 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.
+ *
+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.
+ *
+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally
+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply
+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */
+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{
+	TimestampTz waitStart;
+	char	   *new_status;
+	XLogRecPtr firstWaitWalEnd, lastWaitWalEnd;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(TransactionIdIsValid(slot->effective_catalog_xmin));
+	Assert(slot->effective_catalog_xmin == slot->data.catalog_xmin);
+
+	waitStart = GetCurrentTimestamp();
+	new_status = NULL;			/* we haven't changed the ps display */
+
+	/*
+	 * The master doesn't reply to hot standby feedback explicitly,
+	 * identify which message is the most recent, nor does it report
+	 * the catalog_xmin reserved.
+	 *
+	 * This leaves a potential race. If catalog_xmin is already pinned down by
+	 * some other slot on the master or another standby,
+	 * ShmemVariableCache->oldestCatalogXmin will be set by it. We don't know
+	 * if our hot standby feedback is in effect and pinning down catalog_xmin
+	 * yet. If we start at the current oldestCatalogXmin the other slot might
+	 * advance and allow vacuum to remove tuples we need before our hot standby
+	 * feedback can lock it in. This may result in a conflict with standby at
+	 * some point after we create the slot and start decoding, when we see the
+	 * new xl_xact_catalog_xmin_advance record, unless our own catalog_xmin has
+	 * advanced enough by then that we no longer need the removed catalogs.
+	 * That can only happen if the xact holding down catalog_xmin has committed
+	 * by the time the needed catalogs are removed, so we can decode it,
+	 * advance confirmed_flush_lsn, and advance restart_lsn + catalog_xmin.
+	 *
+	 * To reduce the chances of triggering this race we force immediate
+	 * hot_standby_feedback, wait for a new latestWalEnd report from the
+	 * sender, and wait until we replay past that before we take the
+	 * catalog_xmin to start from. Without the ability to ask the walsender
+	 * to verify receipt of, and successful reservation of, a specific hot
+	 * standby feedback message this is the best we can do.
+	 *
+	 * If we lose the race, decoding will fail with a recovery conflict later.
+	 * The client will have to drop the slot and try again.
+	 *
+	 * Users can further mitigate this risk with a sufficiently high
+	 * vacuum_defer_cleanup_age.
+	 *
+	 * Users can completely prevent this problem by creating a temporary
+	 * logical slot on the master and waiting for the replica to catch up to
+	 * the master's xlog insert position before they create a slot on the
+	 * replica. Then wait until a catalog_xmin is reported on the replica's
+	 * physical slot before dropping the temporary slot on the master.
+	 *
+	 * What we'd really like is to get reply from server explicitly
+	 * confirming that it has applied our hs_feedback and what the lowest
+	 * catalog_xmin it can honour is. This turns out to be tricky to do
+	 * through a cascade, so for now we'll tolerate slow slot creation
+	 * and a small race risk.
+	 */
+
+	firstWaitWalEnd = WalRcv->latestWalEnd;
+	lastWaitWalEnd = firstWaitWalEnd;
+
+	WalRcvForceReply();
+
+	while (lastWaitWalEnd == firstWaitWalEnd ||
+		   GetXLogReplayRecPtr(NULL) < lastWaitWalEnd ||
+		   !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+	{
+		int ret;
+
+		/*
+		 * We need to advance our slot's catalog_xmin to keep pace with the
+		 * latest reported position from the master. That way we won't get
+		 * canceled with a recovery conflict when the master sends catalog_xmin
+		 * updates while we're waiting for redo to catch up with the position
+		 * we saw when we started waiting.
+		 *
+		 * A problem arises here when the server sends an
+		 * xl_xact_catalog_xmin_advance with oldestCatalogXmin = 0, indicating
+		 * it is no longer reserving catalogs. Since we're creating a slot we
+		 * don't mind, but the redo code does not know that and will treat our
+		 * process as conflicting with recovery. To guard against that we'll
+		 * advance our oldestCatalogXmin to the new
+		 * GetOldestSafeDecodingTransactionId() and redo will ignore slots
+		 * whose catalog_xmin is >= nextXid. So long as we loop faster than the
+		 * maximum standby delay we'll keep ahead of recovery cancellations.
+		 * This means we must take XidGenLock once per loop, but it's not like
+		 * we spend a lot of time creating slots.
+		 *
+		 * It's fine for our catalog_xmin to go backwards when the server
+		 * reports it has nailed down catalog_xmin so we just uncondtionally
+		 * reassign our catalog_xmin.
+		 */
+		slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+		ReplicationSlotsComputeRequiredXmin(true);
+
+		LWLockRelease(ProcArrayLock);
+
+		ret = WaitLatch(&MyProc->procLatch,
+						WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						500, WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE);
+
+		if (ret & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (ret & WL_LATCH_SET)
+			ResetLatch(&MyProc->procLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Notice if the server has reported new WAL since we sent our feedback */
+		if (lastWaitWalEnd == firstWaitWalEnd)
+			lastWaitWalEnd = WalRcv->latestWalEnd;
+
+		/* Update process title if waiting long enough */
+		if (update_process_title && new_status == NULL &&
+			TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
+									   500))
+		{
+			const char *old_status;
+			int			len;
+
+			old_status = get_ps_display(&len);
+			new_status = (char *) palloc(len + 8 + 1);
+			memcpy(new_status, old_status, len);
+			strcpy(new_status + len, " waiting");
+			set_ps_display(new_status, false);
+			new_status[len] = '\0'; /* truncate off " waiting" */
+		}
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	}
+
+	if (TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin))
+	{
+		/*
+		 * We didn't reserve the catalog_xmin we wanted, the master has already removed it.
+		 * We have to start decoding at a later point.
+		 */
+		slot->effective_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	}
+
+	ReplicationSlotsComputeRequiredXmin(true);
+
+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+
+	Assert(TransactionIdFollowsOrEquals(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin));
+	Assert(LWLockHeldByMe(ProcArrayLock));
+}
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past
+	 * the threshold we need, removing tuples that we'll require to start
+	 * decoding at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is needed
+		 * here since oldestCatalogXmin can only advance, so if it's past what
+		 * we need that's not going to change. We have marked our slot as active
+		 * so redo won't replay past our catalog_xmin without first terminating our
+		 * session.
+		 */
+		TransactionId shmem_catalog_xmin =
+			*(volatile TransactionId*)(&ShmemVariableCache->oldestCatalogXmin);
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							 NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5237a9f..5f51fa8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -796,6 +796,93 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ *
+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
@@ -843,7 +930,9 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
+		 * (TODO: ask walreceiver to ask walsender to log it or ask bgworker to log it)
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 18d9d7e..2236c5d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -508,9 +508,15 @@ WalReceiverMain(void)
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
+						if (XLogLogicalInfoActive())
+							XLogWalRcvSendHSFeedback(true);
 						XLogWalRcvSendReply(true, false);
 					}
 				}
@@ -1175,8 +1181,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1227,57 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+		
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a00c204..856c40b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -192,7 +192,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -221,6 +220,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1543,6 +1543,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1605,7 +1610,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1626,6 +1631,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1636,59 +1657,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1713,15 +1767,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
@@ -2588,17 +2650,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2632,7 +2683,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0f8f435..6e04604 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1292,17 +1292,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1376,9 +1381,13 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held. Note that
+	 * we're using the effective catalog_xmin for vacuum's tuple removal here,
+	 * as copied over by UpdateOldestCatalogXmin().
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (RecoveryInProgress())
 	{
@@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
+
+	return result;
+}
+
+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to
+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	/*
+	 * If we're not recording logical decoding information, catalog_xmin
+	 * must be unset and we don't need to do any work here.
+	 *
+	 * XXX TODO make sure we zero the checkpointed value when we turn logical decoding
+	 * off, and check it during startup!!
+	 */
+	if (!XLogLogicalInfoActive() && !force)
+	{
+		Assert(!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin));
+		Assert(!TransactionIdIsValid(procArray->replication_slot_catalog_xmin));
+	}
+
+	Assert(XLogInsertAllowed());
+
 	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
+	 * Do an unlocked check first. This is obviously race-prone especially
+	 * since replication_slot_catalog_xmin could be updated after we read
+	 * oldestCatalogXmin. But it doesn't matter if we get wrong results here,
+	 * it'll either cause us to take an unnecessary ProcArrayLock to recheck,
+	 * or delay an update until the next vacuum run.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+	slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
 
-	return result;
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
+
+	/*
+	 * TODO: If we're on replica and using hot standby feedback to set catalog_xmin
+	 * we should be able to directly check the value reserved by feedback via shmem
+	 * from walreceiver, even if xlog replay hasn't passed that point yet.
+	 */
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2656,6 +2745,53 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+			
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
@@ -2964,18 +3100,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		 *retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		 *needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4a21d55..16c2e1f 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..2695fa2 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,7 +153,9 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
@@ -1110,3 +1113,145 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 					 nmsgs * sizeof(SharedInvalidationMessage));
 	XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
 }
+
+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		/* Reset standby wait back-off delay for each session waited for */
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but if we're an intermediate
+		 * cascading standby all we do is pass the catalog_xmin up to our
+		 * master and relay WAL down to the cascaded replica. Conflicts are the
+		 * cascaded replica's problem.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of in-use logical slots.
+		 * Inactive slots have the same effective and actual catalog_xmin, and
+		 * we'll detect conflicts with those when an attempt is made to use
+		 * them. Active slots' catalog_xmin can't go backwards unless they
+		 * become inactive.
+		 *
+		 * We specifically ignore catalog_xmin reservations >= nextXid here to allow
+		 * for slots still being created; see function comment.
+		 */
+		while (slot->in_use && slot->active_pid != 0 &&
+			   TransactionIdIsValid(slot->effective_catalog_xmin) &&
+			   (!TransactionIdIsValid(new_catalog_xmin) ||
+				TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)) &&
+			   TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->nextXid))
+		{
+			/*
+			 * Wait for the conflicting session to exit, signalling it with
+			 * a conflict if necessary.
+			 *
+			 * We'll sleep here, so release the replication slot control lock. No
+			 * new conflicts can appear "behind" our scan of the replication_slots
+			 * array because sessions check the oldestCatalogXmin on decoding
+			 * startup. This lets the exiting backend clear the slot's its
+			 * active_pid.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				/* 
+				 * As a safeguard against signalling the wrong process in case of
+				 * pid reassignment, check that the slot's active_pid hasn't been
+				 * cleared or changed. Do an unlocked read here since the worst
+				 * wrong outcome even in the case of garbage read is an extra
+				 * sleep. If you get a new backend with the same pid in the
+				 * same slot array position you have terrible luck, and it
+				 * might get cancelled with a spurious conflict. 
+				 */
+				if (active_pid != slot->active_pid)
+					continue;
+
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/*
+				 * Wait a little bit for it to die so that we avoid flooding
+				 * an unresponsive backend when system is heavily loaded.
+				 */
+				pg_usleep(5000L);
+			}
+			
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b07d6c6..8f69cfe 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2270,6 +2270,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2692,8 +2695,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2775,6 +2782,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2849,11 +2858,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 522c104..441edbe 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+
+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */
+
 } VariableCacheData;
 
 typedef VariableCacheData *VariableCache;
@@ -173,6 +177,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e7d1191..d40bd4c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -118,7 +118,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -167,6 +167,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+} xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -370,6 +377,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..ef33014 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin; /* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f2daf32..225f509 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -748,7 +748,8 @@ typedef enum
 	WAIT_EVENT_WAL_SENDER_MAIN,
 	WAIT_EVENT_WAL_WRITER_MAIN,
 	WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-	WAIT_EVENT_LOGICAL_APPLY_MAIN
+	WAIT_EVENT_LOGICAL_APPLY_MAIN,
+	WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;
 
 /* ----------
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 62cacdb..9a2dbd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 78e577c..74ae4bf 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -116,6 +116,9 @@ typedef struct
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.
+	 *
+	 * If hot standby feedback is enabled, a hot standby feedback message
+	 * will also be sent.
 	 */
 	bool		force_reply;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9d5a13e..aa35cf7 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
@@ -79,6 +79,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
@@ -87,6 +89,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(bool force);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index d068dde..3a3ba72 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..74713f9 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,6 +34,8 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..ce8a6af
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,454 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, '', 'stderr is empty');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# an initial decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+# If we change the catalogs, we'll get a conflict with recovery, but only
+# if there's an active xact when decoding. Logical decoding
+# doesn't keep a virtualxid while waiting for WAL, only when calling output
+# plugins, so this won't work damn.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+sleep(2);
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

0001-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchtext/x-patch; charset=US-ASCII; name=0001-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patchDownload
From 4427fdf6e18445a4dfcfd98c9bd02125febe8023 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 15 Nov 2016 16:06:16 +0800
Subject: [PATCH 1/3] Add a pg_recvlogical wrapper to PostgresNode

---
 src/test/perl/PostgresNode.pm               | 78 +++++++++++++++++++++++++++++
 src/test/recovery/t/006_logical_decoding.pl | 31 +++++++++++-
 2 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 7e53067..1cfe3bc 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1505,6 +1505,84 @@ sub slot
 
 =pod
 
+=item $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...)
+
+Invoke pg_recvlogical to read from slot_name on dbname until LSN endpos, which
+corresponds to pg_recvlogical --endpos.  Gives up after timeout (if nonzero).
+
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
+
+Plugin options are passed as additional keyword arguments.
+
+If called in scalar context, returns stdout, and die()s on timeout or nonzero return.
+
+If called in array context, returns a tuple of (retval, stdout, stderr, timeout).
+timeout is the IPC::Run::Timeout object whose is_expired method can be tested
+to check for timeout. retval is undef on timeout.
+
+=cut
+
+sub pg_recvlogical_upto
+{
+	my ($self, $dbname, $slot_name, $endpos, $timeout_secs, %plugin_options) = @_;
+	my ($stdout, $stderr);
+
+	my $timeout_exception = 'pg_recvlogical timed out';
+
+	die 'slot name must be specified' unless defined($slot_name);
+	die 'endpos must be specified' unless defined($endpos);
+
+	my @cmd = ('pg_recvlogical', '-S', $slot_name, '--dbname', $self->connstr($dbname));
+	push @cmd, '--endpos', $endpos;
+	push @cmd, '-f', '-', '--no-loop', '--start';
+
+	while (my ($k, $v) = each %plugin_options)
+	{
+		die "= is not permitted to appear in replication option name" if ($k =~ qr/=/);
+		push @cmd, "-o", "$k=$v";
+	}
+
+	my $timeout;
+	$timeout = IPC::Run::timeout($timeout_secs, exception => $timeout_exception ) if $timeout_secs;
+	my $ret = 0;
+
+	do {
+		local $@;
+		eval {
+			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
+			$ret = $?;
+		};
+		my $exc_save = $@;
+		if ($exc_save)
+		{
+			# IPC::Run::run threw an exception. re-throw unless it's a
+			# timeout, which we'll handle by testing is_expired
+			die $exc_save
+			  if (blessed($exc_save) || $exc_save !~ qr/$timeout_exception/);
+
+			$ret = undef;
+
+			die "Got timeout exception '$exc_save' but timer not expired?!"
+			  unless $timeout->is_expired;
+
+			die "$exc_save waiting for endpos $endpos with stdout '$stdout', stderr '$stderr'"
+				unless wantarray;
+		}
+	};
+
+	if (wantarray)
+	{
+		return ($ret, $stdout, $stderr, $timeout);
+	}
+	else
+	{
+		die "pg_recvlogical exited with code '$ret', stdout '$stdout' and stderr '$stderr'" if $ret;
+		return $stdout;
+	}
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 1716360..3f249cd 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -1,9 +1,13 @@
 # Testing of logical decoding using SQL interface and/or pg_recvlogical
+#
+# Most logical decoding tests are in contrib/test_decoding. This module
+# is for work that doesn't fit well there, like where server restarts
+# are required.
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 5;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -35,5 +39,30 @@ $result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_chan
 chomp($result);
 is($result, '', 'Decoding after fast restart repeats no rows');
 
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
+
+my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
+
+$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
+
 # done with the node
 $node_master->stop;
-- 
2.5.5

#38Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Craig Ringer (#37)
Re: Logical decoding on standby

Hi,

I don't know how well I can review the 0001 (the TAP infra patch) but it
looks okay to me.

I don't really have any complaints about 0002 either. I like that it's
more or less one self-contained function and there are no weird ifdefs
anymore like in 9.6 version (btw your commit message talks about 9.5 but
it was 9.6). I also like the clever test :)

I am slightly worried about impact of the readTimeLineHistory() call but
I think it should be called so little that it should not matter.

That brings us to the big patch 0003.

I still don't like the "New in 10.0" comments in documentation, for one
it's 10, not 10.0 and mainly we don't generally write stuff like this to
documentation, that's what release notes are for.

There is large amounts of whitespace mess (empty lines with only
whitespace, spaces at the end of the lines), nothing horrible, but
should be cleaned up.

One thing I don't understand much is the wal_level change and turning
off catalogXmin tracking. I don't really see anywhere that the
catalogXmin would be reset in control file for example. There is TODO in
UpdateOldestCatalogXmin() that seems related but tbh I don't follow
what's happening there - comment says about not doing anything, but the
code inside the if block are only Asserts.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#37)
Re: Logical decoding on standby

On 19 March 2017 at 21:12, Craig Ringer <craig@2ndquadrant.com> wrote:

Rebased attached.

Patch1 looks good to go. I'll correct a spelling mistake in the tap
test when I commit that later today.

Patch2 has a couple of points

2.1 Why does call to ReplicationSlotAcquire() move earlier in
pg_logical_slot_get_changes_guts()?

2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really
documented well.
The setting
sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
should be
sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

but that doesn't cause failure because in read_local_xlog_page() we
say that we are reading from history when
state->currTLI != ThisTimeLineID explicitly rather than use
sendTimeLineIsHistoric

So it looks like we could do with a few extra comments
If you correct these I'll commit it tomorrow.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#39)
Re: Logical decoding on standby

On 20 March 2017 at 14:57, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

2.1 Why does call to ReplicationSlotAcquire() move earlier in
pg_logical_slot_get_changes_guts()?

That appears to be an oversight from an earlier version where it
looped over timelines in pg_logical_slot_get_changes_guts . Reverted.

2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really
documented well.
The setting
sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
should be
sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

Definitely wrong. Fixed.

but that doesn't cause failure because in read_local_xlog_page() we
say that we are reading from history when
state->currTLI != ThisTimeLineID explicitly rather than use
sendTimeLineIsHistoric

XLogRead(...), as called by logical_read_xlog_page, does test it. It's
part of the walsender-local log read callback. We don't hit
read_local_xlog_page at all when we're doing walsender based logical
decoding.

We have two parallel code paths for reading xlogs, one for walsender,
one for normal backends. The walsender one is glued together with a
bunch of globals that pass state "around" the xlogreader - we set it
up before calling into xlogreader, and then examine it when xlogreader
calls back into walsender.c with logical_read_xlog_page.

I really want to refactor that at some stage, getting rid of the use
of walsender globals for timeline state tracking and sharing more of
the xlog reading logic between walsender and normal backends. But
-ENOTIME, especially to do it as carefully as it must be done.

There are comments on read_local_xlog_page, logical_read_xlog_page
that mention this. Also XLogRead in
src/backend/access/transam/xlogutils.c (which has the same name as
XLogRead in src/backend/replication/walsender.c). I have a draft for a
timeline following readme that would address some of this but don't
expect to be able to finish it off for this release cycle, and I'd
really rather clean it up instead.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#40)
2 attachment(s)
Re: Logical decoding on standby

On 20 March 2017 at 17:03, Craig Ringer <craig@2ndquadrant.com> wrote:

On 20 March 2017 at 14:57, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

2.1 Why does call to ReplicationSlotAcquire() move earlier in
pg_logical_slot_get_changes_guts()?

That appears to be an oversight from an earlier version where it
looped over timelines in pg_logical_slot_get_changes_guts . Reverted.

2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really
documented well.
The setting
sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
should be
sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

Definitely wrong. Fixed.

Attached.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Follow-timeline-switches-in-logical-decoding.patchtext/x-patch; charset=US-ASCII; name=0001-Follow-timeline-switches-in-logical-decoding.patchDownload
From fea2e80d1d1efe1c8ca8822357b8985828094877 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 1/2] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.6 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.6
feature freeze when issues were discovered too late to safely fix them
in the 9.6 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 198 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   3 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 ++++++++++++++
 7 files changed, 344 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b2b9fcb..5a51be0 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -662,6 +663,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -677,7 +679,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -704,6 +707,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -754,6 +758,129 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * It's necessary to care about timelines in xlogreader and logical decoding
+ * when we might be reading xlog generated prior to a promotion, either if
+ * we're currently a standby in recovery or if we're a promoted master reading
+ * xlogs generated by the old master before our promotion. Notably, logical
+ * decoding on a standby needs to be able to replay any remaining pending data
+ * from the old timeline when the standby or one of its upstreams being
+ * promoted.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends and we have to switch to a new one.
+ * Even in the middle of reading a page we could have to dump the cached page
+ * and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position if executing in recovery, so it doesn't fail to notice that the
+ * current timeline became historical.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -774,28 +901,71 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding on C, and B gets promoted, our timeline
+		 * will change while we remain in recovery.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			read_upto = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might have to
+			 * wait for the desired record to be generated (or, for a standby,
+			 * received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				read_upto = GetFlushRecPtr();
+			}
+			else
+				read_upto = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
 
-		if (loc <= read_upto)
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 41c5000..16435c0 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -239,7 +239,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	ReplicationSlotAcquire(NameStr(*name));
 
@@ -280,6 +280,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0f6b828..90eb991 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -48,6 +48,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -721,6 +722,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = (state->currTLI == ThisTimeLineID);
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -974,10 +981,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 663d3e7..a1beeb5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -161,6 +161,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 *
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 567a7f3..25a9942 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..142a1b8 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

0002-Logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=0002-Logical-decoding-on-standby.patchDownload
From 34dea26dcd42b123242c326799f2b5ac7714ca95 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 2/2] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
  exit with recovery conflict on upstream drop database when it has an active
  logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.
* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
  requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
  it in checkpoints, make vacuum and bgwriter advance it.
* During decoding startup check whether catalog_xmin requirement can be satisfied
  and bail out if it can not
* Add a new recovery conflict type for conflict with catalog_xmin. Abort
  in-progress logical decoding sessions with conflict with recovery where needed
  catalog_xmin is too old
* Make extra efforts to reserve master's catalog_xmin during decoding startup
  on standby.
* Try to make sure hot_standby_feedback is active when starting
  logical decoding.
* Remove checks preventing starting logical decoding on standby
---
 contrib/pg_visibility/pg_visibility.c              |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 +-
 src/backend/access/heap/heapam.c                   |   2 +-
 src/backend/access/heap/rewriteheap.c              |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c             |   9 +
 src/backend/access/transam/varsup.c                |  14 +
 src/backend/access/transam/xact.c                  |  48 +++
 src/backend/access/transam/xlog.c                  |  16 +-
 src/backend/catalog/index.c                        |   2 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/dbcommands.c                  |   6 +
 src/backend/commands/vacuum.c                      |  13 +-
 src/backend/postmaster/bgwriter.c                  |  10 +
 src/backend/postmaster/pgstat.c                    |   3 +
 src/backend/replication/logical/decode.c           |  11 +
 src/backend/replication/logical/logical.c          | 318 ++++++++++++++-
 src/backend/replication/slot.c                     |  91 ++++-
 src/backend/replication/walreceiver.c              |  52 ++-
 src/backend/replication/walsender.c                | 137 +++++--
 src/backend/storage/ipc/procarray.c                | 190 +++++++--
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  | 147 ++++++-
 src/backend/tcop/postgres.c                        |  38 +-
 src/bin/pg_controldata/pg_controldata.c            |   2 +
 src/include/access/transam.h                       |   5 +
 src/include/access/xact.h                          |  12 +-
 src/include/catalog/pg_control.h                   |   1 +
 src/include/pgstat.h                               |   3 +-
 src/include/replication/slot.h                     |   1 +
 src/include/replication/walreceiver.h              |   3 +
 src/include/storage/procarray.h                    |   9 +-
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +
 src/test/recovery/t/006_logical_decoding.pl        |  29 +-
 .../recovery/t/010_logical_decoding_on_replica.pl  | 454 +++++++++++++++++++++
 36 files changed, 1544 insertions(+), 132 deletions(-)
 create mode 100644 src/test/recovery/t/010_logical_decoding_on_replica.pl

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index d0f7618..6261e68 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -557,7 +557,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -674,7 +674,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 8db1e20..743cbee 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 244e381..d8786f0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1911,10 +1911,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1924,7 +1925,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8526137..07b8fa7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7328,7 +7328,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..36bbb98 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..96ea163 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 42fc351..b6bee35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -393,6 +393,20 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 02e0779..3a5cb0c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5643,6 +5643,54 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+		/*
+		 * Notify any active logical decoding sessions to terminate if they
+		 * need the catalogs we're going to be allowed to remove after
+		 * replaying this record.
+		 */
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}
+
+	return ptr;
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9480377..3cbf42b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5005,6 +5005,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -5017,6 +5018,7 @@ BootStrapXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6607,6 +6609,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6623,6 +6628,7 @@ StartupXLOG(void)
 	ShmemVariableCache->oidCount = 0;
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8692,6 +8698,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -8895,7 +8902,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9258,7 +9265,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
@@ -9617,6 +9624,7 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9715,8 +9723,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8d42a34..7ce7c8f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2270,7 +2270,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b91df98..0f166a0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1000,7 +1000,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5a63b1a..052957b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2124,11 +2124,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ff633fa..46031a8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin();
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
@@ -527,7 +536,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -939,7 +948,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..df239e0 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,15 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly, even if we haven't had a
+		 * recent vacuum run.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin();
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3a50488..b06b7eb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3320,6 +3320,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_LOGICAL_APPLY_MAIN:
 			event_name = "LogicalApplyMain";
 			break;
+		case WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE:
+			event_name = "StandbyLogicalSlotCreate";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..07a120d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..e5f812f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,10 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void WaitForMasterCatalogXminReservation(ReplicationSlot *slot);
+
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));
+
+		/*
+		 * When decoding on a standby we need a physical slot to be used by the
+		 * walrececiver so we can pin the upstream's catalog_xmin down even
+		 * over connection loss and restarts. This also gives us somewhere to
+		 * record our needed catalog xmin on the master.
+		 */
+		if (!walrcv_has_slot)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("no replication slot configured for connection to master"),
+					 errhint("Logical decoding on standby requires that a physical replication slot be used to connect the standby to the master.")));
+
+		/*
+		 * We need hot_standby_feedback to make sure the master doesn't vacuum
+		 * away tuples we need.
+		 *
+		 * This check doesn't stop the user disabling it once we check, but they
+		 * could also drop and re-create the physical replication slot without
+		 * our noticing or do other silly things. Don't do that. If they do it
+		 * anyway we'll notice and fail with conflict with recovery later.
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback is not enabled")));
+	}
 }
 
 /*
@@ -126,6 +164,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
 	 * xmin horizons by other backends, get the safe decoding xid, and inform
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -963,3 +1011,239 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.
+ *
+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.
+ *
+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally
+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply
+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */
+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{
+	TimestampTz waitStart;
+	char	   *new_status;
+	XLogRecPtr firstWaitWalEnd, lastWaitWalEnd;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(TransactionIdIsValid(slot->effective_catalog_xmin));
+	Assert(slot->effective_catalog_xmin == slot->data.catalog_xmin);
+
+	waitStart = GetCurrentTimestamp();
+	new_status = NULL;			/* we haven't changed the ps display */
+
+	/*
+	 * The master doesn't reply to hot standby feedback explicitly,
+	 * identify which message is the most recent, nor does it report
+	 * the catalog_xmin reserved.
+	 *
+	 * This leaves a potential race. If catalog_xmin is already pinned down by
+	 * some other slot on the master or another standby,
+	 * ShmemVariableCache->oldestCatalogXmin will be set by it. We don't know
+	 * if our hot standby feedback is in effect and pinning down catalog_xmin
+	 * yet. If we start at the current oldestCatalogXmin the other slot might
+	 * advance and allow vacuum to remove tuples we need before our hot standby
+	 * feedback can lock it in. This may result in a conflict with standby at
+	 * some point after we create the slot and start decoding, when we see the
+	 * new xl_xact_catalog_xmin_advance record, unless our own catalog_xmin has
+	 * advanced enough by then that we no longer need the removed catalogs.
+	 * That can only happen if the xact holding down catalog_xmin has committed
+	 * by the time the needed catalogs are removed, so we can decode it,
+	 * advance confirmed_flush_lsn, and advance restart_lsn + catalog_xmin.
+	 *
+	 * To reduce the chances of triggering this race we force immediate
+	 * hot_standby_feedback, wait for a new latestWalEnd report from the
+	 * sender, and wait until we replay past that before we take the
+	 * catalog_xmin to start from. Without the ability to ask the walsender
+	 * to verify receipt of, and successful reservation of, a specific hot
+	 * standby feedback message this is the best we can do.
+	 *
+	 * If we lose the race, decoding will fail with a recovery conflict later.
+	 * The client will have to drop the slot and try again.
+	 *
+	 * Users can further mitigate this risk with a sufficiently high
+	 * vacuum_defer_cleanup_age.
+	 *
+	 * Users can completely prevent this problem by creating a temporary
+	 * logical slot on the master and waiting for the replica to catch up to
+	 * the master's xlog insert position before they create a slot on the
+	 * replica. Then wait until a catalog_xmin is reported on the replica's
+	 * physical slot before dropping the temporary slot on the master.
+	 *
+	 * What we'd really like is to get reply from server explicitly
+	 * confirming that it has applied our hs_feedback and what the lowest
+	 * catalog_xmin it can honour is. This turns out to be tricky to do
+	 * through a cascade, so for now we'll tolerate slow slot creation
+	 * and a small race risk.
+	 */
+
+	firstWaitWalEnd = WalRcv->latestWalEnd;
+	lastWaitWalEnd = firstWaitWalEnd;
+
+	WalRcvForceReply();
+
+	while (lastWaitWalEnd == firstWaitWalEnd ||
+		   GetXLogReplayRecPtr(NULL) < lastWaitWalEnd ||
+		   !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+	{
+		int ret;
+
+		/*
+		 * We need to advance our slot's catalog_xmin to keep pace with the
+		 * latest reported position from the master. That way we won't get
+		 * canceled with a recovery conflict when the master sends catalog_xmin
+		 * updates while we're waiting for redo to catch up with the position
+		 * we saw when we started waiting.
+		 *
+		 * A problem arises here when the server sends an
+		 * xl_xact_catalog_xmin_advance with oldestCatalogXmin = 0, indicating
+		 * it is no longer reserving catalogs. Since we're creating a slot we
+		 * don't mind, but the redo code does not know that and will treat our
+		 * process as conflicting with recovery. To guard against that we'll
+		 * advance our oldestCatalogXmin to the new
+		 * GetOldestSafeDecodingTransactionId() and redo will ignore slots
+		 * whose catalog_xmin is >= nextXid. So long as we loop faster than the
+		 * maximum standby delay we'll keep ahead of recovery cancellations.
+		 * This means we must take XidGenLock once per loop, but it's not like
+		 * we spend a lot of time creating slots.
+		 *
+		 * It's fine for our catalog_xmin to go backwards when the server
+		 * reports it has nailed down catalog_xmin so we just uncondtionally
+		 * reassign our catalog_xmin.
+		 */
+		slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+		ReplicationSlotsComputeRequiredXmin(true);
+
+		LWLockRelease(ProcArrayLock);
+
+		ret = WaitLatch(&MyProc->procLatch,
+						WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						500, WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE);
+
+		if (ret & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (ret & WL_LATCH_SET)
+			ResetLatch(&MyProc->procLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Notice if the server has reported new WAL since we sent our feedback */
+		if (lastWaitWalEnd == firstWaitWalEnd)
+			lastWaitWalEnd = WalRcv->latestWalEnd;
+
+		/* Update process title if waiting long enough */
+		if (update_process_title && new_status == NULL &&
+			TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
+									   500))
+		{
+			const char *old_status;
+			int			len;
+
+			old_status = get_ps_display(&len);
+			new_status = (char *) palloc(len + 8 + 1);
+			memcpy(new_status, old_status, len);
+			strcpy(new_status + len, " waiting");
+			set_ps_display(new_status, false);
+			new_status[len] = '\0'; /* truncate off " waiting" */
+		}
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	}
+
+	if (TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin))
+	{
+		/*
+		 * We didn't reserve the catalog_xmin we wanted, the master has already removed it.
+		 * We have to start decoding at a later point.
+		 */
+		slot->effective_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+		slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	}
+
+	ReplicationSlotsComputeRequiredXmin(true);
+
+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+
+	Assert(TransactionIdFollowsOrEquals(slot->effective_catalog_xmin, ShmemVariableCache->oldestCatalogXmin));
+	Assert(LWLockHeldByMe(ProcArrayLock));
+}
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past
+	 * the threshold we need, removing tuples that we'll require to start
+	 * decoding at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is needed
+		 * here since oldestCatalogXmin can only advance, so if it's past what
+		 * we need that's not going to change. We have marked our slot as active
+		 * so redo won't replay past our catalog_xmin without first terminating our
+		 * session.
+		 */
+		TransactionId shmem_catalog_xmin =
+			*(volatile TransactionId*)(&ShmemVariableCache->oldestCatalogXmin);
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							 NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5237a9f..5f51fa8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -796,6 +796,93 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.
+ *
+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
@@ -843,7 +930,9 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
+		 * (TODO: ask walreceiver to ask walsender to log it or ask bgworker to log it)
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 18d9d7e..6221dcf 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -508,9 +508,15 @@ WalReceiverMain(void)
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
+						if (XLogLogicalInfoActive())
+							XLogWalRcvSendHSFeedback(true);
 						XLogWalRcvSendReply(true, false);
 					}
 				}
@@ -1175,8 +1181,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1227,57 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 90eb991..64f73af 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -192,7 +192,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -221,6 +220,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1543,6 +1543,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1605,7 +1610,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1626,6 +1631,22 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1636,59 +1657,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
- * Hot Standby feedback
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
  */
-static void
-ProcessStandbyHSFeedbackMessage(void)
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
 {
 	TransactionId nextXid;
 	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
+ * Hot Standby feedback
+ */
+static void
+ProcessStandbyHSFeedbackMessage(void)
+{
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
-
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1713,15 +1767,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
@@ -2588,17 +2650,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2632,7 +2683,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0f8f435..8b1fc60 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1292,17 +1292,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1376,9 +1381,13 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held. Note that
+	 * we're using the effective catalog_xmin for vacuum's tuple removal here,
+	 * as copied over by UpdateOldestCatalogXmin().
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (RecoveryInProgress())
 	{
@@ -1427,19 +1436,82 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}
+
+	return result;
+}
+
+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to
+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ */
+void
+UpdateOldestCatalogXmin(void)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	Assert(XLogInsertAllowed());
+
 	/*
-	 * After locks have been released and defer_cleanup_age has been applied,
-	 * check whether we need to back up further to make logical decoding
-	 * possible. We need to do so if we're computing the global limit (rel =
-	 * NULL) or if the passed relation is a catalog relation of some kind.
+	 * Do an unlocked check first. This is obviously race-prone especially
+	 * since replication_slot_catalog_xmin could be updated after we read
+	 * oldestCatalogXmin. But it doesn't matter if we get wrong results here,
+	 * it'll either cause us to take an unnecessary ProcArrayLock to recheck,
+	 * or delay an update until the next vacuum run.
+	 *
+	 * Note that we cannot skip this if !XLogLogicalInfoActive(), i.e. if
+	 * wal_level is < logical, because replication slots from a prior
+	 * startup with higher wal_level might still have a catalog_xmin set.
+	 * Testing oldestCatalogXmin and replication_slot_catalog_xmin is
+	 * relatively cheap, though.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+	slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
 
-	return result;
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -2167,14 +2239,20 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
+
+	/*
+	 * TODO: If we're on replica and using hot standby feedback to set catalog_xmin
+	 * we should be able to directly check the value reserved by feedback via shmem
+	 * from walreceiver, even if xlog replay hasn't passed that point yet.
+	 */
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2656,6 +2734,53 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
@@ -2964,18 +3089,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		 *retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		 *needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4a21d55..16c2e1f 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..e255b23 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,7 +153,9 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
@@ -1110,3 +1113,145 @@ LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 					 nmsgs * sizeof(SharedInvalidationMessage));
 	XLogInsert(RM_STANDBY_ID, XLOG_INVALIDATIONS);
 }
+
+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		/* Reset standby wait back-off delay for each session waited for */
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but if we're an intermediate
+		 * cascading standby all we do is pass the catalog_xmin up to our
+		 * master and relay WAL down to the cascaded replica. Conflicts are the
+		 * cascaded replica's problem.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of in-use logical slots.
+		 * Inactive slots have the same effective and actual catalog_xmin, and
+		 * we'll detect conflicts with those when an attempt is made to use
+		 * them. Active slots' catalog_xmin can't go backwards unless they
+		 * become inactive.
+		 *
+		 * We specifically ignore catalog_xmin reservations >= nextXid here to allow
+		 * for slots still being created; see function comment.
+		 */
+		while (slot->in_use && slot->active_pid != 0 &&
+			   TransactionIdIsValid(slot->effective_catalog_xmin) &&
+			   (!TransactionIdIsValid(new_catalog_xmin) ||
+				TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)) &&
+			   TransactionIdPrecedes(slot->effective_catalog_xmin, ShmemVariableCache->nextXid))
+		{
+			/*
+			 * Wait for the conflicting session to exit, signalling it with
+			 * a conflict if necessary.
+			 *
+			 * We'll sleep here, so release the replication slot control lock. No
+			 * new conflicts can appear "behind" our scan of the replication_slots
+			 * array because sessions check the oldestCatalogXmin on decoding
+			 * startup. This lets the exiting backend clear the slot's its
+			 * active_pid.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				/*
+				 * As a safeguard against signalling the wrong process in case of
+				 * pid reassignment, check that the slot's active_pid hasn't been
+				 * cleared or changed. Do an unlocked read here since the worst
+				 * wrong outcome even in the case of garbage read is an extra
+				 * sleep. If you get a new backend with the same pid in the
+				 * same slot array position you have terrible luck, and it
+				 * might get cancelled with a spurious conflict.
+				 */
+				if (active_pid != slot->active_pid)
+					continue;
+
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/*
+				 * Wait a little bit for it to die so that we avoid flooding
+				 * an unresponsive backend when system is heavily loaded.
+				 */
+				pg_usleep(5000L);
+			}
+
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b07d6c6..8f69cfe 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2270,6 +2270,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2692,8 +2695,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2775,6 +2782,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2849,11 +2858,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 522c104..441edbe 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+
+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */
+
 } VariableCacheData;
 
 typedef VariableCacheData *VariableCache;
@@ -173,6 +177,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e7d1191..d40bd4c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -118,7 +118,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -167,6 +167,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+} xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -370,6 +377,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..ef33014 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin; /* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f2daf32..225f509 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -748,7 +748,8 @@ typedef enum
 	WAIT_EVENT_WAL_SENDER_MAIN,
 	WAIT_EVENT_WAL_WRITER_MAIN,
 	WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-	WAIT_EVENT_LOGICAL_APPLY_MAIN
+	WAIT_EVENT_LOGICAL_APPLY_MAIN,
+	WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;
 
 /* ----------
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 62cacdb..9a2dbd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 78e577c..74ae4bf 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -116,6 +116,9 @@ typedef struct
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.
+	 *
+	 * If hot standby feedback is enabled, a hot standby feedback message
+	 * will also be sent.
 	 */
 	bool		force_reply;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9d5a13e..c746464 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
@@ -79,6 +79,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
@@ -87,6 +89,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(void);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index d068dde..3a3ba72 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..74713f9 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,6 +34,8 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 3f249cd..2919cc9 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 5;
+use Test::More tests => 20;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -17,6 +17,10 @@ $node_master->append_conf(
 wal_level = logical
 ));
 $node_master->start;
+
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after start");
+
 my $backup_name = 'master_backup';
 
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -54,7 +58,7 @@ my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logi
 is($stdout_sql, $expected, 'got expected output from SQL decoding session');
 
 my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
-diag "waiting to replay $endpos";
+note "waiting to replay $endpos";
 
 my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
 chomp($stdout_recv);
@@ -64,5 +68,26 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+# Restarting a node with wal_level = logical that has existing
+# slots must succeed, but decoding from those slots must fail.
+$node_master->safe_psql('postgres', 'ALTER SYSTEM SET wal_level = replica');
+is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'logical', 'wal_level is still logical before restart');
+$node_master->restart;
+is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'replica', 'wal_level is replica');
+isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
+	'restored slot catalog_xmin is nonzero');
+is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
+	'reading from slot with wal_level < logical fails');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+	"pg_controldata's oldestCatalogXmin is nonzero");
+is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
+	'can drop logical slot while wal_level = replica');
+is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint");
+
+
 # done with the node
 $node_master->stop;
diff --git a/src/test/recovery/t/010_logical_decoding_on_replica.pl b/src/test/recovery/t/010_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..ce8a6af
--- /dev/null
+++ b/src/test/recovery/t/010_logical_decoding_on_replica.pl
@@ -0,0 +1,454 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# without the catalog_xmin hot standby feedback patch, catalog_xmin is always null
+# and xmin is the min(xmin, catalog_xmin) of all slots on the standby + anything else
+# holding down xmin.
+ok(!$xmin, "xmin null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+ok(!$catalog_xmin, "catalog_xmin null");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay from slot succeeded');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+is($stderr, '', 'stderr is empty');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# an initial decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+# If we change the catalogs, we'll get a conflict with recovery, but only
+# if there's an active xact when decoding. Logical decoding
+# doesn't keep a virtualxid while waiting for WAL, only when calling output
+# plugins, so this won't work damn.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+sleep(2);
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($catalog_xmin, '', "physical catalog_xmin null");
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#42Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#37)
Re: Logical decoding on standby

Hi,

Have you checked how high the overhead of XLogReadDetermineTimeline is?
A non-local function call, especially into a different translation-unit
(no partial inlining), for every single page might end up being
noticeable. That's fine in the cases it actually adds functionality,
but for a master streaming out data, that's not actually adding
anything.

Did you check whether you changes to read_local_xlog_page could cause
issues with twophase.c? Because that now also uses it.

Did you check whether ThisTimeLineID is actually always valid in the
processes logical decoding could run in? IIRC it's not consistently
update during recovery in any process but the startup process.

On 2017-03-19 21:12:23 +0800, Craig Ringer wrote:

From 2fa891a555ea4fb200d75b8c906c6b932699b463 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH 2/3] Follow timeline switches in logical decoding

FWIW, the title doesn't really seem accurate to me.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Everytime I read references to anything like this my blood starts to
boil. I kind of regret not having plastered RecoveryInProgress() errors
all over this code.

From 8854d44e2227b9d076b0a25a9c8b9df9270b2433 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 3/3] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
exit with recovery conflict on upstream drop database when it has an active
logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.

"be called already locked"?

* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback

What does separate mean?

* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
it in checkpoints, make vacuum and bgwriter advance it.

I can't parse that sentence.

* Add a new recovery conflict type for conflict with catalog_xmin. Abort
in-progress logical decoding sessions with conflict with recovery where needed
catalog_xmin is too old

Are we retaining WAL for slots broken in that way?

* Make extra efforts to reserve master's catalog_xmin during decoding startup
on standby.

What does that mean?

* Remove checks preventing starting logical decoding on standby

To me that's too many different things in one commit. A bunch of them
seem like it'd be good if they'd get independent buildfarm cycles too.

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..36bbb98 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
if (!state->rs_logical_rewrite)
return;
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);

What does that comment mean? Vacuum isn't the only thing that prunes old
records.

+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));

Uh, that's long-ish. And doesn't agree with the comment above
(s/startup process/process performing recovery/?).

This is a long enough list that I'd consider just dropping the assert.

+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Unless logical decoding is possible on this node, we don't care about
+		 * this record.
+		 */
+		if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+			return;

Too many negatives for my taste, but whatever.

+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);

Which seems to rely on ResolveRecoveryConflictWithLogicalDecoding's
lwlock acquisition for barriers?

+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	XLogRecPtr ptr = InvalidXLogRecPtr;
+
+	if (XLogInsertAllowed())
+	{
+		xl_xact_catalog_xmin_advance xlrec;
+
+		xlrec.new_catalog_xmin = new_catalog_xmin;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+		ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+	}

Huh, why is this test needed and ok?

@@ -9449,6 +9456,16 @@ XLogReportParameters(void)
XLogFlush(recptr);
}

+		/*
+		 * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+		 * require oldestCatalogXmin in checkpoints and it no longer
+		 * makes sense, so update shmem and xlog the change. This will
+		 * get written out in the next checkpoint.
+		 */
+		if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+			wal_level < WAL_LEVEL_LOGICAL)
+			UpdateOldestCatalogXmin(true);

What if we crash before this happens?

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ff633fa..2d16bf0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
MultiXactId safeMxactLimit;
/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin(false);

I'm on a first read-through through this, but it appears you don't do
anything similar in heap_page_prune()? And we can't just start emitting
loads of additional records there, because it's called much more often...

/*
* Make sure the current settings & environment are capable of doing logical
* decoding.
@@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires a database connection")));

-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		bool walrcv_running, walrcv_has_slot;
+
+		SpinLockAcquire(&WalRcv->mutex);
+		walrcv_running = WalRcv->pid != 0;
+		walrcv_has_slot = WalRcv->slotname[0] != '\0';
+		SpinLockRelease(&WalRcv->mutex);
+
+		/*
+		 * The walreceiver should be running when we try to create a slot. If
+		 * we're unlucky enough to catch the walreceiver just as it's
+		 * restarting after an error, well, the client can just retry. We don't
+		 * bother to sleep and re-check.
+		 */
+		if (!walrcv_running)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("streaming replication is not active"),
+					 errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));

That seems quite problematic. What if there's a momentaneous connection
failure? This also has the issue that just because you checked that
walrcv_running at some point, doesn't guarantee anything by the time you
actually check. Seems like life were easier if recovery.conf were
guc-ified already - checking for primary_conninfo/primary_slot_name etc
wouldn't have that issue (and can't be changed while running).

Usage of a slot doesn't actually guarantee much in cascased setups, does
it?

@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
* xmin horizons by other backends, get the safe decoding xid, and inform
* the slot machinery about the new limit. Once that's done the
* ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * protecting against vacuum - if we're on the master. If we're running on
+	 * a replica, we have to wait until hot_standby_feedback locks in our
+	 * needed catalogs, per details on WaitForMasterCatalogXminReservation().
* ----
*/
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,

ReplicationSlotsComputeRequiredXmin(true);

+	if (RecoveryInProgress())
+		WaitForMasterCatalogXminReservation(slot);
+
+	Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+										 slot->data.catalog_xmin));
+
LWLockRelease(ProcArrayLock);

I think it's quite a bad idea to do a blocking operation like
WaitForMasterCatalogXminReservation while holding ProcArrayLock.

+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.

Ah. I see. Hm :(.

+ * We're pretty much just hoping that, if someone else already has a
+ * catalog_xmin reservation affecting the master, it stays where we want it
+ * until our own hot_standby_feedback can pin it down.

Hm.

+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally

Will that already trigger recovery conflicts?

+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply

"or some other handy bit"?

+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */

I was about to list some of these issues. That's a bit unsatisfying.

Pondering this for a bit, but I'm ~9h into a flight, so maybe not
tonight^Wthis morning^Wwhaaaa.

+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{

This comment seems to duplicate some of the function header
comment. Such duplication usually leads to either or both getting out of
date rather quickly.

Not commenting line-by-line on the code here, but I'm extremely doubtful
that this approach is stable enough, and that the effect of holding
ProcArrayLock exclusively over prolonged amounts of time is acceptable.

+ ReplicationSlotsComputeRequiredXmin(true);

Why do we need this? The caller does it too, no?

+	/* Tell the master what catalog_xmin we settled on */
+	WalRcvForceReply();
+
+	/* Reset ps display if we changed it */
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}

We really shouldn't do stuff like this while holding ProcArrayLock.

+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{

Missing (void).

+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.

Stuff like this really should be it's own commit. It can trivially be
tested on its own, is useful on its own (just have DROP DATABASE do it),
...

+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.

We worked quite hard to make it extremely unlikely for that to happen in
practice. I also don't see why there should be any new PANICs in this
code.

+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.

That seems fine.

+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+	/*
+	 * We only need a shared lock here even though we activate slots,
+	 * because we have an exclusive lock on the database we're dropping
+	 * slots on and don't touch other databases' slots.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

Hm? Acquiring a slot always only takes a shared lock, no?

I don't really see how "database is locked" guarantees enough for your
logic - it's already possible to drop slots from other databases, and
dropping a slot acquires it temporarily?

+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * The caller should have an exclusive lock on the database so
+		 * we'll never have any in-use slots, but just in case...
+		 */
+		if (active_pid)
+			elog(PANIC, "replication slot %s is in use by pid %d",
+				 NameStr(slotname), active_pid);

So, yea, this doesn't seem ok. Why don't we just ERROR out, instead of
PANICing? There seems to be absolutely no correctness reason for a PANIC
here?

+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * There's no race here: we acquired this slot, and no slot "behind"
+		 * our scan can be created or become active with our target dboid due
+		 * to our exclusive lock on the DB.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

I don't see much problem with this, but I'd change the code so you
simply do a goto restart; if you released the slot. Then there's a lot
less chance / complications around temporarily releasing
ReplicationSlotControlLock.

+						 *
+						 * If logical decoding information is enabled, we also
+						 * send immediate hot standby feedback so as to reduce
+						 * the delay before our needed catalogs are locked in.

"logical decoding information ... enabled" and "catalogs are locked in"
are a bit too imprecise descriptions for my taste.

@@ -1175,8 +1181,8 @@ XLogWalRcvSendHSFeedback(bool immed)
{
TimestampTz now;
TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
static TimestampTz sendTime = 0;
/* initially true so we always send at least one feedback message */
static bool master_has_standby_xmin = true;
@@ -1221,29 +1227,57 @@ XLogWalRcvSendHSFeedback(bool immed)
* everything else has been checked.
*/
if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */

I again don't think it's good to refer to vacuum as it's not the only
thing that can remove tuple versions.

+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+
+		/*
+		 * The catalog_Xmin reported by GetOldestXmin is the effective
+		 * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+		 * records from the master. Sending it back to the master would be
+		 * circular and prevent its catalog_xmin ever advancing once set.
+		 * We should only send the catalog_xmin we actually need for slots.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);

Given that you don't have catalog_xmin set by GetOldestXmin that comment
is a bit misleading.

@@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
NormalTransactionIdPrecedes(replication_slot_xmin, result))
result = replication_slot_xmin;

+	if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+	{
+		/*
+		 * After locks have been released and defer_cleanup_age has been applied,
+		 * check whether we need to back up further to make logical decoding
+		 * safe. We need to do so if we're computing the global limit (rel =
+		 * NULL) or if the passed relation is a catalog relation of some kind.
+		 */
+		if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+			NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+			result = replication_slot_catalog_xmin;
+	}

The nesting of these checks, and the comments about them, is a bit
weird.

+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}

Your lines are really long - pgindent (which you really should run) will
much this. I think it'd be better to rephrase this.

+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to

Typo: repliation

+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)

I'm a bit confused by this function and variable name. What does

+	TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+									  * is guaranteed to still exist */

mean? I complained about the overall justification in the commit
already, but looking at this commit alone, the justification for this
part of the change is quite hard to understand.

+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	/*
+	 * If we're not recording logical decoding information, catalog_xmin
+	 * must be unset and we don't need to do any work here.

If we don't need to do any work, shouldn't we return early?

+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		/*
+		 * A concurrent updater could've changed these values so we need to re-check
+		 * under ProcArrayLock before updating.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+		slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);

why are there volatile reads here?

+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);

Why don't we check force here, but above?

@@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void)
oldestSafeXid = ShmemVariableCache->nextXid;

/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by vacuum
+	 * it's definitely safe to start there, and it can't advance
+	 * while we hold ProcArrayLock.

What does "held down by vacuum" mean?

/*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and void a ProcSignal scan later.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+
+			/*
+			 * Kill the pid if it's still here. If not, that's what we
+			 * wanted so ignore any errors.
+			 */
+			(void) SendProcSignal(session_pid,
+				PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);

Doesn't seem ok to do this while holding ProcArrayLock.

+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */

The use of "we" seems confusing here, because it's not the same process.

Generally I think your comments need to be edited a bit for brevity and
preciseness.

+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and wait for it to be free,
+	 * signalling it if necessary, then repeat until there are no more
+	 * conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{

I'm pretty strongly against any code outside of slot.c doing this.

@@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));

/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
*/
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
RecoveryConflictRetryable = false;
}

Hm. Why is this a non-retryable error?

Ok, landing soon. Gotta finish here.

0002 should be doable as a whole this release, I have severe doubts that
0003 as a whole has a chance for 10 - the code is in quite a raw shape,
there's a significant number of open ends. I'd suggest breaking of bits
that are independently useful, and work on getting those committed.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Craig Ringer
craig@2ndquadrant.com
In reply to: Petr Jelinek (#38)
Re: Logical decoding on standby

On 19 March 2017 at 22:12, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

I am slightly worried about impact of the readTimeLineHistory() call but
I think it should be called so little that it should not matter.

Pretty much my thinking too.

That brings us to the big patch 0003.

I still don't like the "New in 10.0" comments in documentation, for one
it's 10, not 10.0 and mainly we don't generally write stuff like this to
documentation, that's what release notes are for.

OK. Personally I think it's worthwhile for protocol docs, which are
more dev-focused. But I agree it's not consistent with the rest of the
docs, so removed.

(Frankly I wish we did this consistently throughout the Pg docs, too,
and it'd be much more user-friendly if we did, but that's just not
going to happen.)

There is large amounts of whitespace mess (empty lines with only
whitespace, spaces at the end of the lines), nothing horrible, but
should be cleaned up.

Fixed.

One thing I don't understand much is the wal_level change and turning
off catalogXmin tracking. I don't really see anywhere that the
catalogXmin would be reset in control file for example. There is TODO in
UpdateOldestCatalogXmin() that seems related but tbh I don't follow
what's happening there - comment says about not doing anything, but the
code inside the if block are only Asserts.

UpdateOldestCatalogXmin(...) with force=true forces a
XactLogCatalogXminUpdate(...) call to write the new
procArray->replication_slot_catalog_xmin .

We call it with force=true from XLogReportParameters(...) when
wal_level has been lowered; see XLogReportParameters. This will write
out a xl_xact_catalog_xmin_advance with
procArray->replication_slot_catalog_xmin's value then update
ShmemVariableCache->oldestCatalogXmin in shmem.
ShmemVariableCache->oldestCatalogXmin will get written out in the next
checkpoint, which gets incorporated in the control file.

There is a problem though - StartupReplicationSlots() and
RestoreSlotFromDisk() don't care if catalog_xmin is set on a slot but
wal_level is < logical and will happily restore a logical slot, or a
physical slot with a catalog_xmin. So we can't actually assume that
procArray->replication_slot_catalog_xmin will be 0 if we're not
writing new logical WAL. This isn't a big deal, it just means we can't
short-circuit UpdateOldestCatalogXmin() calls if
!XLogLogicalInfoActive(). It also means the XLogReportParameters()
stuff can be removed since we don't care about wal_level for tracking
oldestCatalogXmin.

Fixed in updated patch.

I'm now reading over Andres's review.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#42)
Re: Logical decoding on standby

.On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:

Hi,

Have you checked how high the overhead of XLogReadDetermineTimeline is?
A non-local function call, especially into a different translation-unit
(no partial inlining), for every single page might end up being
noticeable. That's fine in the cases it actually adds functionality,
but for a master streaming out data, that's not actually adding
anything.

I don't anticipate any significant effect given the large amount of
indirection via decoding, reorder buffer, snapshot builder, output
plugin, etc that we already do and how much memory allocation gets
done ... but it's worth checking. I could always move the fast path
into a macro or inline function if it does turn out to make a
detectable difference.

One of the things I want to get to is refactoring all the xlog page
reading stuff into a single place, shared between walsender and normal
backends, to get rid of this confusing mess we currently have. The
only necessary difference is how we wait for new WAL, the rest should
be as common as possible allowing for xlogreader's needs. I
particularly want to get rid of the two identically named static
XLogRead functions. But all that probably means making timeline.c
FRONTEND friendly and it's way too intrusive to contemplate at this
stage.

Did you check whether you changes to read_local_xlog_page could cause
issues with twophase.c? Because that now also uses it.

Thanks, that's a helpful point. The commit in question is 978b2f65. I
didn't notice that it introduced XLogReader use in twophase.c, though
I should've realised given the discussion about fetching recent 2pc
info from xlog. I don't see any potential for issues at first glance,
but I'll go over it in more detail. The main concern is validity of
ThisTimeLineID, but since it doesn't run in recovery I don't see much
of a problem there. That also means it can afford to use the current
timeline-oblivious read_local_xlog_page safely.

TAP tests for 2pc were added by 3082098. I'll check to make sure they
have appropriate coverage for this.

Did you check whether ThisTimeLineID is actually always valid in the
processes logical decoding could run in? IIRC it's not consistently
update during recovery in any process but the startup process.

I share your concerns that it may not be well enough maintained.
Thankyou for the reminder, that's been on my TODO and got lost when I
had to task-hop to other priorities.

I have some TAP tests to validate promotion that need finishing off.
My main concern is around live promotions, both promotion of standby
to master, and promotion of upstream master when streaming from a
cascaded replica.

[Will cover review of 0003 separately, next]

0002 should be doable as a whole this release, I have severe doubts that
0003 as a whole has a chance for 10 - the code is in quite a raw shape,
there's a significant number of open ends. I'd suggest breaking of bits
that are independently useful, and work on getting those committed.

That would be my preference too.

I do not actually feel strongly about the need for logical decoding on
standby, and would in many ways prefer to defer it until we have
two-way hot standby feedback and the ability to have the master
confirm the actual catalog_xmin locked in to eliminate the current
race and ugly workaround for it. I'd rather have solid timeline
following in place now and bare-minimum failover capability.

I'm confident that the ability for xlogreader to follow timeline
switches will also be independently useful.

The parts I think are important for Pg10 are:

* Teach xlogreader to follow timeline switches
* Ability to create logical slots on replicas
* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at all.
* Ability to drop logical slots on replicas

That would be enough to provide minimal standby promotion without hideous hacks.

Unfortunately, slot creation on standby is probably the ugliest part
of the patch. It can be considerably simplified by imposing the rule
that the application must ensure catalog_xmin on the master is already
held down (with a replication slot) before creating a slot on the
standby, and it's the application's job to send feedback to the master
before any standbys it's keeping slots on. If the app fails to do so,
the slot on the downstream will become unusable and attempts to decode
changes from it will fail with conflict with recovery.

That'd get rid of a lot of the code including some of the ugliest
bits, since we'd no longer make any special effort with catalog_xmin
during slot creation. We're already pushing complexity onto apps for
this, after concluding that the transparent failover slots approach
wasn't the way forward, so I'm OK with that. Let the apps that want
logical decoding to support physical replica promotion pay most of the
cost.

I'd then like to revisit full decoding on standby later, once we have
2-way hot standby feedback, where the upstream can reply with
confirmation xmin is locked in, including cascading handling.

Getting there would mostly involve trimming this patch down, which is
nice. It would be necessary to add a SQL function and/or walsender
command to send feedback on a slot we're not currently replaying
changes from, but I see that as independently valuable and have wanted
it for a number of things already. We'd still have to decode (so we
found the right restart_lsn), but we'd suppress output plugin calls
entirely.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#42)
Re: Logical decoding on standby

On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:

Subject: [PATCH 2/3] Follow timeline switches in logical decoding

FWIW, the title doesn't really seem accurate to me.

Yeah, it's not really at the logical decoding layer at all.

"Teach xlogreader to follow timeline switches" ?

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

Everytime I read references to anything like this my blood starts to
boil. I kind of regret not having plastered RecoveryInProgress() errors
all over this code.

In fairness, I've been trying for multiple releases to get a "right"
way in. I have no intention of using such hacks, and only ever did so
for testing xlogreader timeline following without full logical
decoding on standby being available.

From 8854d44e2227b9d076b0a25a9c8b9df9270b2433 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 5 Sep 2016 15:30:53 +0800
Subject: [PATCH 3/3] Logical decoding on standby

* Make walsender aware of ProcSignal and recovery conflicts, make walsender
exit with recovery conflict on upstream drop database when it has an active
logical slot on that database.
* Allow GetOldestXmin to omit catalog_xmin, be called already locked.

"be called already locked"?

To be called with ProcArrayLock already held. But that's actually
outdated due to changes Petr requested earlier, thanks for noticing.

* Send catalog_xmin separately in hot_standby_feedback messages.
* Store catalog_xmin separately on a physical slot if received in hot_standby_feedback

What does separate mean?

Currently, hot standby feedback sends effectively the
min(catalog_xmin, xmin) to the upstream, which in turn records that
either in the PGPROC entry or, if there's a slot in use, in the xmin
field on the slot.

So catalog_xmin on the standby gets promoted to xmin on the master's
physical slot. Lots of unnecessary bloat results.

This splits it up, so we can send catalog_xmin separately on the wire,
and store it on the physical replication slot as catalog_xmin, not
xmin.

* Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
it in checkpoints, make vacuum and bgwriter advance it.

I can't parse that sentence.

We now write an xlog record before allowing the catalog_xmin in
ProcArray replication_slot_catalog_xmin to advance and allow catalog
tuples to be removed. This is achieved by making vacuum use a separate
field in ShmemVariableCache, oldestCatalogXmin. When vacuum looks up
the new xmin from GetOldestXmin, it copies
ProcArray.replication_slot_catalog_xmin to
ShmemVariableCache.oldestCatalogXmin, writing an xlog record to ensure
we remember the new value and ensure standbys know about it.

This provides a guarantee to standbys that all catalog tuples >=
ShmemVariableCache.oldestCatalogXmin are protected from vacuum and
lets them discover when that threshold advances.

The reason we cannot use the xid field in existing vacuum xlog records
is that the downstream has no way to know if the xact affected
catalogs and therefore whether it should advance its idea of
catalog_xmin or not. It can't get a Relation for the affected
relfilenode because it can't use the relcache during redo. We'd have
to add a flag to every vacuum record indicating whether it affected
catalogs, which is not fun, and vacuum might not always even know. And
the standby would still need a way to keep track of the oldest valid
catalog_xmin across restart without the ability to write it to
checkpoints.

It's a lot simpler and cheaper to have the master do it.

* Add a new recovery conflict type for conflict with catalog_xmin. Abort
in-progress logical decoding sessions with conflict with recovery where needed
catalog_xmin is too old

Are we retaining WAL for slots broken in that way?

Yes, until the slot is dropped.

If I added a persistent flag on the slot to indicate that the slot is
invalid, then we could ignore it for purposes of WAL retention. It
seemed unnecessary at this stage.

* Make extra efforts to reserve master's catalog_xmin during decoding startup
on standby.

What does that mean?

WaitForMasterCatalogXminReservation(...)

I don't like it. At all. I'd rather have hot standby feedback replies
so we can know for sure when the master has locked in our feedback.
It's my most disliked part of this patch.

* Remove checks preventing starting logical decoding on standby

To me that's too many different things in one commit. A bunch of them
seem like it'd be good if they'd get independent buildfarm cycles too.

I agree with you. I had them all separate before and was told that
there were too many patches. I also had fixes that spanned multiple
patches and were difficult to split up effectively.

I'd like to split it roughly along the lines of the bulletted items,
but I don't want to do it only to have someone else tell me to just
squash it again and waste all the work (again). I'll take the risk I
guess.

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..36bbb98 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
if (!state->rs_logical_rewrite)
return;
-     ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+     /* Use the catalog_xmin being retained by vacuum */
+     ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);

What does that comment mean? Vacuum isn't the only thing that prunes old
records.

I mean to refer to ShmemVariableCache.oldestCatalogXmin, the effective
catalog xmin used for record removal, not
ProcArray.replication_slot_catalog_xmin, the pending catalog_xmin for
local slots.

i.e. use the catalog_xmin that we've recorded in WAL and promised to standbys.

I agree the comment is unclear. Not sure how to improve it without
making it overly long though.

+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+     Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));

Uh, that's long-ish. And doesn't agree with the comment above
(s/startup process/process performing recovery/?).

This is a long enough list that I'd consider just dropping the assert.

Fair enough.

+     else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+     {
+             xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+             /*
+              * Unless logical decoding is possible on this node, we don't care about
+              * this record.
+              */
+             if (!XLogLogicalInfoActive() || max_replication_slots == 0)
+                     return;

Too many negatives for my taste, but whatever.

Also removed in latest version, since it turns out not be accurate.

I had made the incorrect assumption that our global catalog_xmin was
necessarily 0 when wal_level < logical. But this is not the case, per
the new TAP tests in latest patch. We can have logical slots from when
wal_level was logical still existing with a valid catalog_xmin after
we restart into wal_level = replica.

+             /*
+              * Apply the new catalog_xmin limit immediately. New decoding sessions
+              * will refuse to start if their slot is past it, and old ones will
+              * notice when we signal them with a recovery conflict. There's no
+              * effect on the catalogs themselves yet, so it's safe for backends
+              * with older catalog_xmins to still exist.
+              *
+              * We don't have to take ProcArrayLock since only the startup process
+              * is allowed to change oldestCatalogXmin when we're in recovery.
+              */
+             SetOldestCatalogXmin(xlrec->new_catalog_xmin);

Which seems to rely on ResolveRecoveryConflictWithLogicalDecoding's
lwlock acquisition for barriers?

I don't yet have a really solid grasp of memory ordering and barrier
issues in multiprocessing. As I understand it, processes created after
this point aren't going to see the old value, they'll fork() with a
current snapshot of memory, so either they'll see the new value or
they'll be captured by our
ResolveRecoveryConflictWithLogicalDecoding() run (assuming they don't
exit first).

New decoding sessions for existing backends would be an issue. They
call EnsureActiveLogicalSlotValid() which performs a volatile read on
ShmemVariableCache->oldestCatalogXmin . But that isn't sufficient, is
it? We need a write barrier in SetOldestCatalogXmin and a read barrier
in EnsureActiveLogicalSlotValid.

I'll fix that. Thanks very much.

+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+     XLogRecPtr ptr = InvalidXLogRecPtr;
+
+     if (XLogInsertAllowed())
+     {
+             xl_xact_catalog_xmin_advance xlrec;
+
+             xlrec.new_catalog_xmin = new_catalog_xmin;
+
+             XLogBeginInsert();
+             XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+
+             ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+     }

Huh, why is this test needed and ok?

Good point. It isn't anymore.

I previously had catalog_xmin advances on replicas running a similar
path and skipping xlog. But that was fragile. So now
UpdateOldestCatalogXmin() is only called from the master, per the
assertion at the start, so it's unnecessary to test for
XLogInsertAllowed( ) here.

Fixed.

@@ -9449,6 +9456,16 @@ XLogReportParameters(void)
XLogFlush(recptr);
}

+             /*
+              * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
+              * require oldestCatalogXmin in checkpoints and it no longer
+              * makes sense, so update shmem and xlog the change. This will
+              * get written out in the next checkpoint.
+              */
+             if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
+                     wal_level < WAL_LEVEL_LOGICAL)
+                     UpdateOldestCatalogXmin(true);

What if we crash before this happens?

We run XLogReportParameters before we set ControlFile->state =
DB_IN_PRODUCTION, so we'd re-run recovery and call it again next time
through.

But as it turns out the above is neither necessary nor correct anyway,
it relies on the invalid assumption that catalog_xmin is 0 when
wal_level is < logical. Per above, not the case, so we can't
short-circuit catalog_xmin logging tests when wal_level = replica.

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ff633fa..2d16bf0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
MultiXactId safeMxactLimit;
/*
+      * When logical decoding is enabled, we must write any advance of
+      * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+      * This ensures that any standbys doing logical decoding can cancel
+      * decoding sessions and invalidate slots if we remove tuples they
+      * still need.
+      */
+     UpdateOldestCatalogXmin(false);

I'm on a first read-through through this, but it appears you don't do
anything similar in heap_page_prune()? And we can't just start emitting
loads of additional records there, because it's called much more often...

vacuum_set_xid_limits sets OldestXmin in lazy_vacuum_rel, which is the
OldestXmin passed to heap_page_prune.

I could Assert that heap_page_prune's OldestXmin PrecedesOrEquals
ShmemVariableCache->oldestCatalogXmin I guess. It seemed unnecessary.

+             /*
+              * The walreceiver should be running when we try to create a slot. If
+              * we're unlucky enough to catch the walreceiver just as it's
+              * restarting after an error, well, the client can just retry. We don't
+              * bother to sleep and re-check.
+              */
+             if (!walrcv_running)
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("streaming replication is not active"),
+                                      errhint("Logical decoding on standby requires that streaming replication be configured and active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs.")));

That seems quite problematic. What if there's a momentaneous connection
failure? This also has the issue that just because you checked that
walrcv_running at some point, doesn't guarantee anything by the time you
actually check. Seems like life were easier if recovery.conf were
guc-ified already - checking for primary_conninfo/primary_slot_name etc
wouldn't have that issue (and can't be changed while running).

Yes, I very much wish walreceiver were already GUC-ified. I'd rather
test primary_conninfo and primary_slot_name .

Usage of a slot doesn't actually guarantee much in cascased setups, does
it?

It doesn't entirely eliminate the potential for a race with catalog
removal, but neither does hot_standby_feedback on a non-cascading
setup. Right now we tolerate that race and the risk that the slot may
become invalid. The application can prevent that by making sure it has
a slot on the master and the standby has caught up past the master's
lsn at the time of that slot's creation before it creates a slot on
the standby.

That's part of why the hoop jumping for catalog_xmin advance. To make
sure we know, for sure, if it's safe to decode from a slot given that
we haven't been able to directly enforce our xmin on the master.

To get rid of that race without application intervention, we need the
ability for a feedback message to flow up the cascade, and a reply
that specifically matches that feedback message (or at least that
individual downstream node) to flow down the cascade.

I'm working on just that, but there's no way it'll be ready for pg10
obviously, and it has some difficult issues. It's actually intended to
help prevent conflict with standby cancels shortly after hot_standby
starts up, but it'll help with slot creation too.

Even with all that, we'll still need some kind of xlog'd catalog_xmin
knowledge, because users can do silly things like drop the slot
connecting standby to master and re-create it, causing the standby's
needed catalog_xmin on the master to become un-pinned. We don't want
to risk messily crashing if that happens.

@@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
* xmin horizons by other backends, get the safe decoding xid, and inform
* the slot machinery about the new limit. Once that's done the
* ProcArrayLock can be released as the slot machinery now is
-      * protecting against vacuum.
+      * protecting against vacuum - if we're on the master. If we're running on
+      * a replica, we have to wait until hot_standby_feedback locks in our
+      * needed catalogs, per details on WaitForMasterCatalogXminReservation().
* ----
*/
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,

ReplicationSlotsComputeRequiredXmin(true);

+     if (RecoveryInProgress())
+             WaitForMasterCatalogXminReservation(slot);
+
+     Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
+                                                                              slot->data.catalog_xmin));
+
LWLockRelease(ProcArrayLock);

I think it's quite a bad idea to do a blocking operation like
WaitForMasterCatalogXminReservation while holding ProcArrayLock.

+/*
+ * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
+ * if needed. Caller must hold exclusive ProcArrayLock, which this function will
+ * temporarily release while sleeping but will re-acquire.

Ah. I see. Hm :(.

Exactly.

I'm increasingly inclined to rip that out and make preventing races
with master catalog removal the application's problem. Create a slot
on the master first, or accept that you may have to retry.

+ * When we're creating a slot on a standby we can't directly set the
+ * master's catalog_xmin; the catalog_xmin is set locally, then relayed
+ * over hot_standby_feedback. The master may remove the catalogs we
+ * asked to reserve between when we set a local catalog_xmin and when
+ * hs feedback makes that take effect on the master. We need a feedback
+ * reply mechanism here, where:
+ *
+ * - we tentatively reserve catalog_xmin locally

Will that already trigger recovery conflicts?

If we already have local active slots, we'll be using their existing
catalog_xmin and there's no issue.

If we don't already have local slots the only conflict potential is
this backend. It could potentially cause a conflict if we replayed a
greatly advanced catalog_xmin from the master before we got the chance
to advance our local one accordingly.

+ * - we wake the walreceiver by setting its latch
+ * - walreceiver sends hs_feedback
+ * - upstream walsender sends a new 'hs_feedback reply' message with
+ *   actual (xmin, catalog_xmin) reserved.
+ * - walreceiver sees reply and updates ShmemVariableCache or some other
+ *   handy bit of shmem with hs feedback reservations from reply

"or some other handy bit"?

Ha. Will fix.

+ * - we poll the reservations while we wait
+ * - we set our catalog_xmin to that value, which might be later if
+ *   we missed our requested reservation, or might be earlier if
+ *   someone else is holding down catalog_xmin on master. We got a hs
+ *   feedback reply so we know it's reserved.
+ *
+ * For cascading, the actual reservation will need to cascade up the
+ * chain by walsender setting its own walreceiver's latch in turn, etc.
+ *
+ * For now, we just set the local slot catalog_xmin and sleep until
+ * oldestCatalogXmin equals or passes our reservation. This is fine if we're
+ * the only decoding session, but it is vulnerable to races if slots on the
+ * master or other decoding sessions on other standbys connected to the same
+ * master exist. They might advance their reservation before our hs_feedback
+ * locks it down, allowing vacuum to remove tuples we need. So we might start
+ * decoding on our slot then error with a conflict with recovery when we see
+ * catalog_xmin advance.
+ */

I was about to list some of these issues. That's a bit unsatisfying.

I concur. I just don't have a better answer.

I think I'd like to rip it out and make it the application's problem
until we can do it right.

Pondering this for a bit, but I'm ~9h into a flight, so maybe not
tonight^Wthis morning^Wwhaaaa.

+static void
+WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
+{

This comment seems to duplicate some of the function header
comment. Such duplication usually leads to either or both getting out of
date rather quickly.

Not commenting line-by-line on the code here, but I'm extremely doubtful
that this approach is stable enough, and that the effect of holding
ProcArrayLock exclusively over prolonged amounts of time is acceptable.

+ ReplicationSlotsComputeRequiredXmin(true);

Why do we need this? The caller does it too, no?

Because we force a walsender update immediately and want a current value.

It's cheap enough for running in slot creation.

+     /* Tell the master what catalog_xmin we settled on */
+     WalRcvForceReply();
+
+     /* Reset ps display if we changed it */
+     if (new_status)
+     {
+             set_ps_display(new_status, false);
+             pfree(new_status);
+     }

We really shouldn't do stuff like this while holding ProcArrayLock.

Yeah, good point.

+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid()
+{

Missing (void).

Augh, C++ still has its tentacles in my brain.

+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the database
+ * to ensure no replication slots on the database are in use.

Stuff like this really should be it's own commit. It can trivially be
tested on its own, is useful on its own (just have DROP DATABASE do it),

Agreed, will do.

+ * If we fail here we'll leave the in-memory state of replication slots
+ * inconsistent with its on-disk state, so we need to PANIC.

We worked quite hard to make it extremely unlikely for that to happen in
practice. I also don't see why there should be any new PANICs in this
code.

I didn't figure out a sensible way not to. I'll revisit that.

+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+     int                     i;
+
+     if (max_replication_slots <= 0)
+             return;
+
+     /*
+      * We only need a shared lock here even though we activate slots,
+      * because we have an exclusive lock on the database we're dropping
+      * slots on and don't touch other databases' slots.
+      */
+     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

Hm? Acquiring a slot always only takes a shared lock, no?

I don't really see how "database is locked" guarantees enough for your
logic - it's already possible to drop slots from other databases, and
dropping a slot acquires it temporarily?

You can drop slots from other DBs.

Ugh. Right. That's a frustrating oversight. I'll have to revisit that logic.

+     for (i = 0; i < max_replication_slots; i++)
+     {
+             ReplicationSlot *s;
+             NameData slotname;
+             int active_pid;
+
+             s = &ReplicationSlotCtl->replication_slots[i];
+
+             /* cannot change while ReplicationSlotCtlLock is held */
+             if (!s->in_use)
+                     continue;
+
+             /* only logical slots are database specific, skip */
+             if (!SlotIsLogical(s))
+                     continue;
+
+             /* not our database, skip */
+             if (s->data.database != dboid)
+                     continue;
+
+             /* Claim the slot, as if ReplicationSlotAcquire()ing */
+             SpinLockAcquire(&s->mutex);
+             strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+             NameStr(slotname)[NAMEDATALEN-1] = '\0';
+             active_pid = s->active_pid;
+             if (active_pid == 0)
+             {
+                     MyReplicationSlot = s;
+                     s->active_pid = MyProcPid;
+             }
+             SpinLockRelease(&s->mutex);
+
+             /*
+              * The caller should have an exclusive lock on the database so
+              * we'll never have any in-use slots, but just in case...
+              */
+             if (active_pid)
+                     elog(PANIC, "replication slot %s is in use by pid %d",
+                              NameStr(slotname), active_pid);

So, yea, this doesn't seem ok. Why don't we just ERROR out, instead of
PANICing? There seems to be absolutely no correctness reason for a PANIC
here?

We've acquired the slot but it's active by another backend. Something
broke. But you're right, PANIC is an over-reaction.

+             /*
+              * To avoid largely duplicating ReplicationSlotDropAcquired() or
+              * complicating it with already_locked flags for ProcArrayLock,
+              * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+              * just release our ReplicationSlotControlLock to drop the slot.
+              *
+              * There's no race here: we acquired this slot, and no slot "behind"
+              * our scan can be created or become active with our target dboid due
+              * to our exclusive lock on the DB.
+              */
+             LWLockRelease(ReplicationSlotControlLock);
+             ReplicationSlotDropAcquired();
+             LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

I don't see much problem with this, but I'd change the code so you
simply do a goto restart; if you released the slot. Then there's a lot
less chance / complications around temporarily releasing
ReplicationSlotControlLock.

Good idea.

+                                              *
+                                              * If logical decoding information is enabled, we also
+                                              * send immediate hot standby feedback so as to reduce
+                                              * the delay before our needed catalogs are locked in.

"logical decoding information ... enabled"

XLogLogicalInfoActive()

and "catalogs are locked in"

Yeah, fair.

are a bit too imprecise descriptions for my taste.

Will adjust.

+             xmin = GetOldestXmin(NULL,
+                                                      false, /* don't ignore vacuum */
+                                                      true /* ignore catalog xmin */);
+
+             /*
+              * The catalog_Xmin reported by GetOldestXmin is the effective
+              * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
+              * records from the master. Sending it back to the master would be
+              * circular and prevent its catalog_xmin ever advancing once set.
+              * We should only send the catalog_xmin we actually need for slots.
+              */
+             ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);

Given that you don't have catalog_xmin set by GetOldestXmin that comment
is a bit misleading.

It is. Too many revisions with too much time between them. Fixing.

@@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
NormalTransactionIdPrecedes(replication_slot_xmin, result))
result = replication_slot_xmin;

+     if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
+     {
+             /*
+              * After locks have been released and defer_cleanup_age has been applied,
+              * check whether we need to back up further to make logical decoding
+              * safe. We need to do so if we're computing the global limit (rel =
+              * NULL) or if the passed relation is a catalog relation of some kind.
+              */
+             if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
+                     NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+                     result = replication_slot_catalog_xmin;
+     }

The nesting of these checks, and the comments about them, is a bit
weird.

Agree. I didn't find it readable in one test, and it wasn't clear how
to comment on just the inner part of the tests without splitting it
up. But it's easy enough to merge, I just found it less readable and
harder to understand.

+/*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+     return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+                     || (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}

Your lines are really long - pgindent (which you really should run) will
much this. I think it'd be better to rephrase this.

Thanks. Will.

IIRC pgindent created a LOT of unrelated noise at the time I was
working on it, but I'll recheck.

+/*
+ * If necessary, copy the current catalog_xmin needed by repliation slots to

Typo: repliation

Thanks.

+ * the effective catalog_xmin used for dead tuple removal.
+ *
+ * When logical decoding is enabled we write a WAL record before advancing the
+ * effective value so that standbys find out if catalog tuples they still need
+ * get removed, and can properly cancel decoding sessions and invalidate slots.
+ *
+ * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
+ * and need to clear the shmem state, since we want to bypass the wal_level
+ * check and force xlog writing.
+ */
+void
+UpdateOldestCatalogXmin(bool force)

I'm a bit confused by this function and variable name. What does

+       TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+                                                                         * is guaranteed to still exist */

mean? I complained about the overall justification in the commit
already, but looking at this commit alone, the justification for this
part of the change is quite hard to understand.

Standbys have no way to know what catalog row versions are guaranteed to exist.

They know, from vacuum xlog records, when we remove row versions,
index entries, etc associated with a transaction. But the standby has
no way to know if the affected relation is a catalog or not, it only
knows the relfilenode. So it can't maintain a local notion of
"effective global catalog_xmin on the master as of the last xlog
record I replayed".

I could add is_catalog flags to all the vacuum xlog records via a
secondary struct that's only added when wal_level = logical, but that
seems pretty awful and likely to be very noisy. It also wouldn't help
the standby know, at startup, what the current catalog_xmin of the
master is since it won't be in a checkpoint or the control file.

+{
+     TransactionId vacuum_catalog_xmin;
+     TransactionId slots_catalog_xmin;
+
+     /*
+      * If we're not recording logical decoding information, catalog_xmin
+      * must be unset and we don't need to do any work here.

If we don't need to do any work, shouldn't we return early?

Yes.

+     if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
+     {
+             XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+             LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+             /*
+              * A concurrent updater could've changed these values so we need to re-check
+              * under ProcArrayLock before updating.
+              */
+             vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
+             slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);

why are there volatile reads here?

Because I didn't understand volatile well enough. It's not a memory
barrier and provides no guarantee that we're seeing recent values.

It should probably just take ProcArrayLock.

+             if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+                     SetOldestCatalogXmin(slots_catalog_xmin);

Why don't we check force here, but above?

Good point.

I've removed force anyway, in the latest revision. Same reason as
given above re the StartupXLOG parameters check stuff.

@@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void)
oldestSafeXid = ShmemVariableCache->nextXid;

/*
-      * If there's already a slot pegging the xmin horizon, we can start with
-      * that value, it's guaranteed to be safe since it's computed by this
-      * routine initially and has been enforced since.
+      * If there's already an effectiveCatalogXmin held down by vacuum
+      * it's definitely safe to start there, and it can't advance
+      * while we hold ProcArrayLock.

What does "held down by vacuum" mean?

Brain fart. Held down by an existing slot. Comment also needs rewording.

/*
+ * Notify a logical decoding session that it conflicts with a
+ * newly set catalog_xmin from the master.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+     ProcArrayStruct *arrayP = procArray;
+     int                     index;
+
+     /*
+      * We have to scan ProcArray to find the process and set a pending recovery
+      * conflict even though we know the pid. At least we can get the BackendId
+      * and void a ProcSignal scan later.
+      *
+      * The pid might've gone away, in which case we got the desired
+      * outcome anyway.
+      */
+     LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+     for (index = 0; index < arrayP->numProcs; index++)
+     {
+             int                     pgprocno = arrayP->pgprocnos[index];
+             volatile PGPROC *proc = &allProcs[pgprocno];
+
+             if (proc->pid == session_pid)
+             {
+                     VirtualTransactionId procvxid;
+
+                     GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+                     proc->recoveryConflictPending = true;
+
+                     /*
+                      * Kill the pid if it's still here. If not, that's what we
+                      * wanted so ignore any errors.
+                      */
+                     (void) SendProcSignal(session_pid,
+                             PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
+
+                     break;
+             }
+     }
+
+     LWLockRelease(ProcArrayLock);

Doesn't seem ok to do this while holding ProcArrayLock.

Fair enough. And I guess it's safe enough to take and release it,
since new processes that start won't be at risk of cancellation so we
don't care about whether or not we scan them.

+/*
+ * Scan to see if any clients are using replication slots that are below the
+ * new catalog_xmin theshold and sigal them to terminate with a recovery
+ * conflict.
+ *
+ * We already applied the new catalog_xmin record and updated the shmem
+ * catalog_xmin state, so new clients that try to use a replication slot
+ * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
+ * don't have to guard against them here.
+ *
+ * Replay can only continue safely once every slot that needs the catalogs
+ * we're going to free for removal is gone. So if any conflicting sessions
+ * exist, wait for any standby conflict grace period then signal them to exit.
+ *
+ * The master might clear its reserved catalog_xmin if all upstream slots are
+ * removed or clear their feedback reservations, sending us
+ * InvalidTransactionId. If we're concurrently trying to create a new slot and
+ * reserve catalogs the InvalidXid reservation report might come in while we
+ * have a slot waiting for hs_feedback confirmation of its reservation. That
+ * would cause the waiting process to get canceled with a conflict with
+ * recovery here since its tentative reservation conflicts with the master's
+ * report of 'nothing reserved'. To allow it to continue to seek a startpoint
+ * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
+ * still looking for where to start. We'll sometimes notice a conflict but the
+ * slot will advance its catalog_xmin to a more recent nextXid and cease to
+ * conflict when we re-check. (The alternative is to track slots being created
+ * differently to slots actively decoding in shmem, which seems unnecessary. Or
+ * to separate the 'tentative catalog_xmin reservation' of a slot from its
+ * actual needed catalog_xmin.)
+ *
+ * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
+ * walsender-based logical decoding sessions won't have any virtualxid for much
+ * of their life and the end of their virtualxids doesn't mean the end of a
+ * potential conflict. It would also cancel too aggressively, since it cares
+ * about the backend's xmin and logical decoding only needs the catalog_xmin.
+ */

The use of "we" seems confusing here, because it's not the same process.

Generally I think your comments need to be edited a bit for brevity and
preciseness.

Will work on it.

Me, verbose? Really?

+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+     int i;
+
+     if (!InHotStandby)
+             /* nobody can be actively using logical slots */
+             return;
+
+     /* Already applied new limit, can't have replayed later one yet */
+     Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+     /*
+      * Find the first conflicting active slot and wait for it to be free,
+      * signalling it if necessary, then repeat until there are no more
+      * conflicts.
+      */
+     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+     for (i = 0; i < max_replication_slots; i++)
+     {

I'm pretty strongly against any code outside of slot.c doing this.

IIRC I originally tried to do that as part of slot.c but found that it
resulted in other ugliness relating to access to other structures. But
I can't remember what anymore, so I'll revisit it.

@@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));

/*
-              * All conflicts apart from database cause dynamic errors where the
-              * command or transaction can be retried at a later point with some
-              * potential for success. No need to reset this, since non-retryable
-              * conflict errors are currently FATAL.
+              * All conflicts apart from database and catalog_xmin cause dynamic
+              * errors where the command or transaction can be retried at a later
+              * point with some potential for success. No need to reset this, since
+              * non-retryable conflict errors are currently FATAL.
*/
-             if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+             if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+                     reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
RecoveryConflictRetryable = false;
}

Hm. Why is this a non-retryable error?

The global catalog_xmin isn't going to go backwards, so if the slot
needs a given catalog_xmin and we want to discard it....

... then we should give it a while to catch up. Right. It should be retryable.

Ok, landing soon. Gotta finish here.

Greatly appreciated, I know it's not the nicest to review.

0002 should be doable as a whole this release, I have severe doubts that
0003 as a whole has a chance for 10 - the code is in quite a raw shape,
there's a significant number of open ends. I'd suggest breaking of bits
that are independently useful, and work on getting those committed.

I'll be doing that, yes.

I really want some way to create slots on replicas, advance them to
follow the master's position, and have them able to be used after
promotion to master.

I don't think actually live decoding on replica is ready yet, though
I'd find the ability to shift decoding workloads to replicas rather
nice when it is ready.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#45)
Re: Logical decoding on standby

On 21 March 2017 at 02:21, Craig Ringer <craig@2ndquadrant.com> wrote:

On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:

Subject: [PATCH 2/3] Follow timeline switches in logical decoding

FWIW, the title doesn't really seem accurate to me.

Yeah, it's not really at the logical decoding layer at all.

"Teach xlogreader to follow timeline switches" ?

Happy with that. I think Craig has addressed Andres' issues with this
patch, so I will apply later today as planned using that name.

The longer Logical Decoding on Standby will not be applied yet and not
without further changess, per review.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#44)
Re: Logical decoding on standby

On 21 March 2017 at 09:05, Craig Ringer <craig@2ndquadrant.com> wrote:

Thanks, that's a helpful point. The commit in question is 978b2f65. I
didn't notice that it introduced XLogReader use in twophase.c, though
I should've realised given the discussion about fetching recent 2pc
info from xlog. I don't see any potential for issues at first glance,
but I'll go over it in more detail. The main concern is validity of
ThisTimeLineID, but since it doesn't run in recovery I don't see much
of a problem there. That also means it can afford to use the current
timeline-oblivious read_local_xlog_page safely.

TAP tests for 2pc were added by 3082098. I'll check to make sure they
have appropriate coverage for this.

The TAP tests pass fine, and I can't see any likely issues either.

XLogReader for 2PC doesn't happen on standby, and RecoveryInProgress()
will update ThisTimeLineID on promotion.

Did you check whether ThisTimeLineID is actually always valid in the
processes logical decoding could run in? IIRC it's not consistently
update during recovery in any process but the startup process.

I share your concerns that it may not be well enough maintained.
Thankyou for the reminder, that's been on my TODO and got lost when I
had to task-hop to other priorities.

The main place we maintain ThisTimeLineID (outside StartupXLOG of
course) is in walsender's GetStandbyFlushRecPtr, which calls
GetWalRcvWriteRecPtr. That's not used in walsender's logical decoding
or in the SQL interface.

I've changed the order of operations in read_local_xlog_page to ensure
that RecoveryInProgress() updates ThisTimeLineID if we're promoted,
and made it update ThisTimeLineID from GetXLogReplayRecPtr otherwise.

pg_logical_slot_get_changes_guts was fine already.

Because xlog read callbacks must not attempt to read pages past the
flush limit (master) or replay limit (standby), it doesn't matter if
ThisTimeLineID is completely up to date, only that it's valid as-of
that LSN.

I did identify one problem. The startup process renames the last
segment in a timeline to .partial when it processes a timeline switch.
See xlog.c:7597. So if we have the following order of operations:

* Update ThisTimeLineID to 2 at latest redo ptr
* XLogReadDetermineTimeline chooses timeline 2 to read from
* startup process replays timeline switch to TL 3 and renames last
segment in old timeline to .partial
* XLogRead() tries to open segment with TL 2

we'll fail. I don't think it matters much though. We're not actually
supporting streaming decoding from standby this release by the looks,
and even if we did the effect would be limited to an ERROR and a
reconnect. It doesn't look like there's really any sort of lock or
other synchronisation we can rely on to prevent this, and we should
probably just live with it. If we have already opened the segment
we'll just keep reading from it without noticing the rename; if we
haven't and are switching to it just as it's renamed we'll ERROR when
we try to open it.

I had cascading and promotion tests in progress for decoding on
standby, but doubt there's much point finishing them off now that it's
not likely that decoding on standby can be added for this CF.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#47)
1 attachment(s)
Re: Logical decoding on standby

Hi all

Updated timeline following patch attached.

There's a change in read_local_xlog_page to ensure we maintain
ThisTimeLineID properly, otherwise it's just comment changes.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Teach-xlogreader-to-follow-timeline-switches.patchtext/x-patch; charset=US-ASCII; name=0001-Teach-xlogreader-to-follow-timeline-switches.patchDownload
From d42ceaec47793f67c55523d1aeb72be61c4f2dea Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH] Teach xlogreader to follow timeline switches

The XLogReader was timeline-agnostic and assumed that all WAL segments
requested would be on ThisTimeLineID.

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created or advanced on a replica so this is
mostly foundation work for subsequent changes to enable logical decoding on
standbys.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with logical decoding on a standby.

Note that an earlier version of logical decoding timeline following
was committed to 9.6 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.6
feature freeze when issues were discovered too late to safely fix them
in the 9.6 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.
---
 src/backend/access/transam/xlogutils.c             | 213 +++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c     |   8 +-
 src/backend/replication/walsender.c                |  11 +-
 src/include/access/xlogreader.h                    |  16 ++
 src/include/access/xlogutils.h                     |   3 +
 src/test/recovery/Makefile                         |   2 +
 .../recovery/t/009_logical_decoding_timelines.pl   | 130 +++++++++++++
 7 files changed, 364 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/009_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b2b9fcb..28c07d3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,6 +19,7 @@
 
 #include <unistd.h>
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -662,6 +663,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
+	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -677,7 +679,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
+			sendTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
@@ -704,6 +707,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -754,6 +758,133 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
+ * Determine which timeline to read an xlog page from and set the
+ * XLogReaderState's currTLI to that timeline ID.
+ *
+ * We care about timelines in xlogreader when we might be reading xlog
+ * generated prior to a promotion, either if we're currently a standby in
+ * recovery or if we're a promoted master reading xlogs generated by the old
+ * master before our promotion.
+ *
+ * wantPage must be set to the start address of the page to read and
+ * wantLength to the amount of the page that will be read, up to
+ * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
+ *
+ * We switch to an xlog segment from the new timeline eagerly when on a
+ * historical timeline, as soon as we reach the start of the xlog segment
+ * containing the timeline switch.  The server copied the segment to the new
+ * timeline so all the data up to the switch point is the same, but there's no
+ * guarantee the old segment will still exist. It may have been deleted or
+ * renamed with a .partial suffix so we can't necessarily keep reading from
+ * the old TLI even though tliSwitchPoint says it's OK.
+ *
+ * We can't just check the timeline when we read a page on a different segment
+ * to the last page. We could've received a timeline switch from a cascading
+ * upstream, so the current segment ends apruptly (possibly getting renamed to
+ * .partial) and we have to switch to a new one.  Even in the middle of reading
+ * a page we could have to dump the cached page and switch to a new TLI.
+ *
+ * Because of this, callers MAY NOT assume that currTLI is the timeline that
+ * will be in a page's xlp_tli; the page may begin on an older timeline or we
+ * might be reading from historical timeline data on a segment that's been
+ * copied to a new timeline.
+ *
+ * The caller must also make sure it doesn't read past the current replay
+ * position (using GetWalRcvWriteRecPtr) if executing in recovery, so it
+ * doesn't fail to notice that the current timeline became historical. The
+ * caller must also update ThisTimeLineID with the result of
+ * GetWalRcvWriteRecPtr and must check RecoveryInProgress().
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+{
+	const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff;
+
+	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
+	Assert(wantLength <= XLOG_BLCKSZ);
+	Assert(state->readLen == 0 || state->readLen <= XLOG_BLCKSZ);
+
+	/*
+	 * If the desired page is currently read in and valid, we have nothing to do.
+	 *
+	 * The caller should've ensured that it didn't previously advance readOff
+	 * past the valid limit of this timeline, so it doesn't matter if the current
+	 * TLI has since become historical.
+	 */
+	if (lastReadPage == wantPage &&
+		state->readLen != 0 &&
+		lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1))
+		return;
+
+	/*
+	 * If we're reading from the current timeline, it hasn't become historical
+	 * and the page we're reading is after the last page read, we can again
+	 * just carry on. (Seeking backwards requires a check to make sure the older
+	 * page isn't on a prior timeline).
+	 *
+	 * ThisTimeLineID might've become historical since we last looked, but the
+	 * caller is required not to read past the flush limit it saw at the time
+	 * it looked up the timeline. There's nothing we can do about it if
+	 * StartupXLOG() renames it to .partial concurrently.
+	 */
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	{
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
+		return;
+	}
+
+	/*
+	 * If we're just reading pages from a previously validated historical
+	 * timeline and the timeline we're reading from is valid until the
+	 * end of the current segment we can just keep reading.
+	 */
+	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0 &&
+		(wantPage + wantLength) / XLogSegSize < state->currTLIValidUntil / XLogSegSize)
+		return;
+
+	/*
+	 * If we reach this point we're either looking up a page for random access,
+	 * the current timeline just became historical, or we're reading from a new
+	 * segment containing a timeline switch. In all cases we need to determine
+	 * the newest timeline on the segment.
+	 *
+	 * If it's the current timeline we can just keep reading from here unless
+	 * we detect a timeline switch that makes the current timeline historical.
+	 * If it's a historical timeline we can read all the segment on the newest
+	 * timeline because it contains all the old timelines' data too. So only
+	 * one switch check is required.
+	 */
+	{
+		/*
+		 * We need to re-read the timeline history in case it's been changed
+		 * by a promotion or replay from a cascaded replica.
+		 */
+		List *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		XLogRecPtr endOfSegment = (((wantPage / XLogSegSize) + 1) * XLogSegSize) - 1;
+
+		Assert(wantPage / XLogSegSize == endOfSegment / XLogSegSize);
+
+		/* Find the timeline of the last LSN on the segment containing wantPage. */
+		state->currTLI = tliOfPointInHistory(endOfSegment, timelineHistory);
+		state->currTLIValidUntil = tliSwitchPoint(state->currTLI, timelineHistory,
+			&state->nextTLI);
+
+		Assert(state->currTLIValidUntil == InvalidXLogRecPtr ||
+				wantPage + wantLength < state->currTLIValidUntil);
+
+		list_free_deep(timelineHistory);
+
+		elog(DEBUG3, "switched to timeline %u valid until %X/%X",
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+	}
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -774,28 +905,84 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Determine the limit of xlog we can currently read to, and what the
+		 * most recent timeline is.
+		 *
+		 * RecoveryInProgress() will update ThisTimeLineID when it first
+		 * notices recovery finishes, so we only have to maintain it for the
+		 * local process until recovery ends.
 		 */
 		if (!RecoveryInProgress())
-		{
-			*pageTLI = ThisTimeLineID;
 			read_upto = GetFlushRecPtr();
+		else
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+
+		*pageTLI = ThisTimeLineID;
+
+		/*
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it each time through the loop because if we're in
+		 * recovery as a cascading standby, the current timeline might've
+		 * become historical. We can't rely on RecoveryInProgress() because
+		 * in a standby configuration like
+		 *
+		 *    A => B => C
+		 *
+		 * if we're a logical decoding session on C, and B gets promoted, our
+		 * timeline will change while we remain in recovery.
+		 *
+		 * We can't just keep reading from the old timeline as the last WAL
+		 * archive in the timeline will get renamed to .partial by StartupXLOG().
+		 *
+		 * If that happens after our caller updated ThisTimeLineID but before
+		 * we actually read the xlog page, we might still try to read from the
+		 * old (now renamed) segment and fail. There's not much we can do about
+		 * this, but it can only happen when we're a leaf of a cascading
+		 * standby whose master gets promoted while we're decoding, so a
+		 * one-off ERROR isn't too bad.
+		 */
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+
+		if (state->currTLI == ThisTimeLineID)
+		{
+
+			if (loc <= read_upto)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			read_upto = GetXLogReplayRecPtr(pageTLI);
+		{
+			/*
+			 * We're on a historical timeline, so limit reading to the switch
+			 * point where we moved to the next timeline.
+			 *
+			 * We don't need to GetFlushRecPtr or GetXLogReplayRecPtr. We know
+			 * about the new timeline, so we must've received past the end of
+			 * it.
+			 */
+			read_upto = state->currTLIValidUntil;
 
-		if (loc <= read_upto)
+			/*
+			 * Setting pageTLI to our wanted record's TLI is slightly wrong;
+			 * the page might begin on an older timeline if it contains a
+			 * timeline switch, since its xlog segment will have been copied
+			 * from the prior timeline. This is pretty harmless though, as
+			 * nothing cares so long as the timeline doesn't go backwards.  We
+			 * should read the page header instead; FIXME someday.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 41c5000..c251b92 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -235,11 +235,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
-	/* compute the current end-of-wal */
+	/*
+	 * Compute the current end-of-wal and maintain ThisTimeLineID.
+	 * RecoveryInProgress() will update ThisTimeLineID on promotion.
+	 */
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
 
 	ReplicationSlotAcquire(NameStr(*name));
 
@@ -280,6 +283,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
 			   (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0f6b828..90eb991 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -48,6 +48,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -721,6 +722,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+	sendTimeLineIsHistoric = (state->currTLI == ThisTimeLineID);
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -974,10 +981,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage(&buf);
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 663d3e7..a1beeb5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -161,6 +161,22 @@ struct XLogReaderState
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID	currTLI;
+	/*
+	 * Safe point to read to in currTLI if current TLI is historical
+	 * (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
+	 *
+	 * Actually set to the start of the segment containing the timeline
+	 * switch that ends currTLI's validity, not the LSN of the switch
+	 * its self, since we can't assume the old segment will be present.
+	 */
+	XLogRecPtr	currTLIValidUntil;
+	/*
+	 * If currTLI is not the most recent known timeline, the next timeline to
+	 * read from when currTLIValidUntil is reached.
+	 */
+	TimeLineID	nextTLI;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 567a7f3..25a9942 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -52,4 +52,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetRecPtr, char *cur_page,
 					 TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state,
+					XLogRecPtr wantPage, uint32 wantLength);
+
 #endif
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..142a1b8 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-------------------------------------------------------------------------
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/009_logical_decoding_timelines.pl b/src/test/recovery/t/009_logical_decoding_timelines.pl
new file mode 100644
index 0000000..09830dc
--- /dev/null
+++ b/src/test/recovery/t/009_logical_decoding_timelines.pl
@@ -0,0 +1,130 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use RecursiveCopy;
+use File::Copy;
+use IPC::Run ();
+use Scalar::Util qw(blessed);
+
+my ($stdout, $stderr, $ret);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+diag "Testing logical timeline following with a filesystem-level copy";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres',
+	'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, 'before_basebackup',
+	'Expected to find only slot before_basebackup on replica');
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();");
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+# Shouldn't be able to read from slot created after base backup
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+);
+is($ret, 3, 'replaying from after_basebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "after_basebackup" does not exist/,
+	'after_basebackup slot missing');
+
+# Should be able to read from slot created before base backup
+($ret, $stdout, $stderr) = $node_replica->psql(
+	'postgres',
+"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
+	timeout => 30);
+is($ret, 0, 'replay from slot before_basebackup succeeds');
+
+my $final_expected_output_bb = q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT);
+is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
+is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+
+# So far we've peeked the slots, so when we fetch the same info over
+# pg_recvlogical we should get complete results. First, find out the commit lsn
+# of the last transaction. There's no max(pg_lsn), so:
+
+my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
+
+# now use the walsender protocol to peek the slot changes and make sure we see
+# the same results.
+
+$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
+	$endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
+
+# walsender likes to add a newline
+chomp($stdout);
+is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
-- 
2.5.5

#49Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#48)
2 attachment(s)
Re: Logical decoding on standby

On 22 March 2017 at 10:51, Craig Ringer <craig@2ndquadrant.com> wrote:

Hi all

Updated timeline following patch attached.

There's a change in read_local_xlog_page to ensure we maintain
ThisTimeLineID properly, otherwise it's just comment changes.

OK, so we're looking OK with the TL following.

I'm splitting up the rest of the decoding on standby patch set with
the goal of getting minimal functionality for creating and managing
slots on standbys in, so we can maintain slots on standbys and use
them when the standby is promoted to master.

The first, to send catalog_xmin separately to the global xmin on
hot_standby_feedback and store it in the upstream physical slot's
catalog_xmin, is attached.

These are extracted directly from the logical decoding on standby
patch, with comments by Petr and Andres made re the relevant code
addressed.

I will next be working on a bare-minimum facility for creating and
advancing logical slots on a replica without support for buffering
changes, creating historic snapshots or invoking output plugin. The
slots will become usable after the replica is promoted. They'll track
their own restart_lsn, etc, and will keep track of xids so they can
manage their catalog_xmin, so there'll be no need for dodgy slot
syncing from the master, but they'll avoid most of the complex and
messy bits. The application will be expected to make sure a slot on
the master exists and is advanced before the corresponding slot on the
replica to protect required catalogs.

Then if there's agreement that it's the right way forward I can add
the catalog_xmin xlog'ing stuff as the next patch.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Allow-GetOldestXmin-to-ignore-replication-slot-xmin.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-GetOldestXmin-to-ignore-replication-slot-xmin.patchDownload
From b719c0b556a6823c9b48c0f4042aaf77a8d5f69e Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 12:11:17 +0800
Subject: [PATCH 1/2] Allow GetOldestXmin to ignore replication slot xmin

For walsender to report replication slots' catalog_xmin separately, it's
necessary to be able to ask GetOldestXmin to ignore replication slots.
---
 contrib/pg_visibility/pg_visibility.c |  4 ++--
 contrib/pgstattuple/pgstatapprox.c    |  2 +-
 src/backend/access/transam/xlog.c     |  4 ++--
 src/backend/catalog/index.c           |  2 +-
 src/backend/commands/analyze.c        |  2 +-
 src/backend/commands/vacuum.c         |  4 ++--
 src/backend/replication/walreceiver.c |  2 +-
 src/backend/storage/ipc/procarray.c   | 36 +++++++++++++++++++++++++++--------
 src/include/storage/procarray.h       |  2 +-
 9 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index d0f7618..6261e68 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -557,7 +557,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, true);
+		OldestXmin = GetOldestXmin(NULL, true, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -674,7 +674,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+				RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 8db1e20..743cbee 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, true);
+	OldestXmin = GetOldestXmin(rel, true, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9480377..c2b4f2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8895,7 +8895,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9258,7 +9258,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, false));
+		TruncateSUBTRANS(GetOldestXmin(NULL, false, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8d42a34..7ce7c8f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2270,7 +2270,7 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, true);
+		OldestXmin = GetOldestXmin(heapRelation, true, false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b91df98..0f166a0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1000,7 +1000,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, true);
+	OldestXmin = GetOldestXmin(onerel, true, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ff633fa..bdc7e16 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -527,7 +527,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true), rel);
+		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, true, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -939,7 +939,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, true);
+	newFrozenXid = GetOldestXmin(NULL, true, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 18d9d7e..b1ab8e0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1221,7 +1221,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+		xmin = GetOldestXmin(NULL, false, false);
 	else
 		xmin = InvalidTransactionId;
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0f8f435..63083c9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1292,17 +1292,22 @@ TransactionIdIsActive(TransactionId xid)
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
  * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
+ * possibility that we lose data that the standby would like to have
+ * unless the standby uses a replication slot to make its xmin persistent
+ * even when it isn't connected. The Hot Standby code deals with such cases by
+ * failing standby queries that needed to access already-removed data, so
+ * there's no integrity bug.
+ *
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * The caller may request that replication slots' catalog_xmin values be
+ * disregarded when calculating the global xmin. The caller must account
+ * for catalog_xmin separately.
  */
 TransactionId
-GetOldestXmin(Relation rel, bool ignoreVacuum)
+GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1376,7 +1381,9 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 		}
 	}
 
-	/* fetch into volatile var while ProcArrayLock is held */
+	/*
+	 * Fetch slot xmins into volatile var while ProcArrayLock is held.
+	 */
 	replication_slot_xmin = procArray->replication_slot_xmin;
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
@@ -1430,11 +1437,24 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 	/*
 	 * After locks have been released and defer_cleanup_age has been applied,
 	 * check whether we need to back up further to make logical decoding
+	 * safe. We need to do so if we're computing the global limit (rel =
+	 * NULL) or if the passed relation is a catalog relation of some kind,
+	 * unless the caller asked us not to.
+	 */
+	if (!ignoreCatalogXmin &&
+		(rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)) &&
+		TransactionIdIsValid(replication_slot_catalog_xmin) &&
+		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
+		result = replication_slot_catalog_xmin;
+
+	/*
+	 * After locks have been released and defer_cleanup_age has been applied,
+	 * check whether we need to back up further to make logical decoding
 	 * possible. We need to do so if we're computing the global limit (rel =
 	 * NULL) or if the passed relation is a catalog relation of some kind.
 	 */
 	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
+		RelationIsAccessibleInLogicalDecoding(rel)) &&
 		TransactionIdIsValid(replication_slot_catalog_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
 		result = replication_slot_catalog_xmin;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9d5a13e..21d022f 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -53,7 +53,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum, bool ignoreCatalogXmin);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
-- 
2.5.5

0002-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchtext/x-patch; charset=US-ASCII; name=0002-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchDownload
From e303c2f8706c7a54460ab66fd2d1d0196361a99a Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 12:29:13 +0800
Subject: [PATCH 2/2] Report catalog_xmin separately to xmin in hot standby
 feedback

The catalog_xmin of slots on a standby was reported as part of the standby's
xmin, causing the master's xmin to be held down. This could cause considerable
unnecessary bloat on the master.

Instead, report catalog_xmin as a separate field in hot_standby_feedback. If
the upstream walsender is using a physical replication slot, store the
catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
slot and has only a PGPROC entry behaviour doesn't change, as we store the
combined xmin and catalog_xmin in the PGPROC entry.

There's no backward compatibility concern here, as nothing except another
postgres instance of the same major version has any business sending hot
standby feedback and it's only used on the physical replication protocol.

e Please enter the commit message for your changes. Lines starting
---
 doc/src/sgml/protocol.sgml                         |  33 ++++++-
 src/backend/replication/walreceiver.c              |  43 ++++++--
 src/backend/replication/walsender.c                | 110 +++++++++++++++------
 .../recovery/t/010_logical_decoding_timelines.pl   |  38 ++++++-
 4 files changed, 175 insertions(+), 49 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 244e381..d8786f0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1911,10 +1911,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1924,7 +1925,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b1ab8e0..60c1aba 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1175,8 +1175,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1221,54 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+
+		/*
+		 * Obtain catalog_xmin to send separately, so the walsender can store
+		 * it on a physical slot's catalog_xmin if one is in use.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7561770..05b51a0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -221,6 +221,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1605,7 +1606,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1626,6 +1627,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1636,59 +1645,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
- * Hot Standby feedback
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
  */
-static void
-ProcessStandbyHSFeedbackMessage(void)
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
 {
 	TransactionId nextXid;
 	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
+ * Hot Standby feedback
+ */
+static void
+ProcessStandbyHSFeedbackMessage(void)
+{
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
-
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1713,15 +1755,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 09830dc..4561a06 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -20,7 +20,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 10;
 use RecursiveCopy;
 use File::Copy;
 use IPC::Run ();
@@ -31,10 +31,14 @@ my ($stdout, $stderr, $ret);
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1, has_archiving => 1);
-$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
-$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
-$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
-$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->append_conf('postgresql.conf', q[
+wal_level = 'logical'
+max_replication_slots = 3
+max_wal_senders = 2
+log_min_messages = 'debug2'
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+]);
 $node_master->dump_info;
 $node_master->start;
 
@@ -51,11 +55,17 @@ $node_master->safe_psql('postgres', 'CHECKPOINT;');
 my $backup_name = 'b1';
 $node_master->backup_fs_hot($backup_name);
 
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot');]);
+
 my $node_replica = get_new_node('replica');
 $node_replica->init_from_backup(
 	$node_master, $backup_name,
 	has_streaming => 1,
 	has_restoring => 1);
+$node_replica->append_conf(
+	'recovery.conf', q[primary_slot_name = 'phys_slot']);
+
 $node_replica->start;
 
 $node_master->safe_psql('postgres',
@@ -71,6 +81,24 @@ $stdout = $node_replica->safe_psql('postgres',
 is($stdout, 'before_basebackup',
 	'Expected to find only slot before_basebackup on replica');
 
+# Examine the physical slot the replica uses to stream changes
+# from the master to make sure its hot_standby_feedback
+# has locked in a catalog_xmin on the physical slot, and that
+# any xmin is < the catalog_xmin
+$node_master->poll_query_until('postgres', q[
+	SELECT catalog_xmin IS NOT NULL
+	FROM pg_replication_slots
+	WHERE slot_name = 'phys_slot'
+	]);
+my $phys_slot = $node_master->slot('phys_slot');
+isnt($phys_slot->{'xmin'}, '',
+	'xmin assigned on physical slot of master');
+isnt($phys_slot->{'catalog_xmin'}, '',
+	'catalog_xmin assigned on physical slot of master');
+# Ignore wrap-around here, we're on a new cluster:
+cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
+	   'xmin on physical slot must not be lower than catalog_xmin');
+
 # Boom, crash
 $node_master->stop('immediate');
 
-- 
2.5.5

#50Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#44)
Re: Logical decoding on standby

Hi,

On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:

0002 should be doable as a whole this release, I have severe doubts that
0003 as a whole has a chance for 10 - the code is in quite a raw shape,
there's a significant number of open ends. I'd suggest breaking of bits
that are independently useful, and work on getting those committed.

That would be my preference too.

The parts I think are important for Pg10 are:

* Ability to create logical slots on replicas

Doesn't this also imply recovery conflicts on DROP DATABASE? Besides,
allowing to drop all slots using a database upon DROP DATABASE, is a
useful thing on its own.

But I have to admit, I've *severe* doubts about getting the whole
infrastructure for slot creation on replica into 10. The work is far
from ready, and we're mere days away from freeze.

* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at al

That pretty much requires decoding, otherwise you really don't know how
much WAL you have to retain.

* Ability to drop logical slots on replicas

That shouldn't actually require any changes, no?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Andres Freund (#50)
Re: Logical decoding on standby

On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:

But I have to admit, I've *severe* doubts about getting the whole
infrastructure for slot creation on replica into 10. The work is far
from ready, and we're mere days away from freeze.

If Craig has to guess what would be acceptable, then its not long enough.

It would be better if you could outline a specific approach so he can
code it. Coding takes about a day for most things, since Craig knows
the code and what we're trying to achieve.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#51)
Re: Logical decoding on standby

On 2017-03-22 14:58:29 +0000, Simon Riggs wrote:

On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:

But I have to admit, I've *severe* doubts about getting the whole
infrastructure for slot creation on replica into 10. The work is far
from ready, and we're mere days away from freeze.

If Craig has to guess what would be acceptable, then its not long enough.

I don't know what you're on about with that statement. I've spent a
good chunk of time looking at the 0003 patch, even though it's large and
contains a lot of different things. I suggested splitting things up. I
even suggested what to move earlier after Craig agreed with that
sentiment, in the mail you're replying to, because it seems
independently doable.

It would be better if you could outline a specific approach so he can
code it. Coding takes about a day for most things, since Craig knows
the code and what we're trying to achieve.

I find that fairly unconvincing. What we have here is a patch that isn't
close to being ready, contains a lot of complicated pieces, a couple
days before freeze. If we can pull useful pieces out: great. But it's
too later for major new development.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Andres Freund (#50)
Re: Logical decoding on standby

On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:

The parts I think are important for Pg10 are:

* Ability to create logical slots on replicas

Doesn't this also imply recovery conflicts on DROP DATABASE?

Not needed until the slot is in use, which is a later patch.

Besides,
allowing to drop all slots using a database upon DROP DATABASE, is a
useful thing on its own.

Sure but that's a separate feature unrelated to this patch and we're
running out of time.

* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at al

That pretty much requires decoding, otherwise you really don't know how
much WAL you have to retain.

Knowing how much WAL to retain is key.

Why would decoding tell you how much WAL to retain?

We tried to implement this automatically from the master, which was
rejected. So the only other way is manually. We need one or the other.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#49)
1 attachment(s)
Re: Logical decoding on standby

On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:

I'm splitting up the rest of the decoding on standby patch set with
the goal of getting minimal functionality for creating and managing
slots on standbys in, so we can maintain slots on standbys and use
them when the standby is promoted to master.

The first, to send catalog_xmin separately to the global xmin on
hot_standby_feedback and store it in the upstream physical slot's
catalog_xmin, is attached.

These are extracted directly from the logical decoding on standby
patch, with comments by Petr and Andres made re the relevant code
addressed.

I've reduced your two patches back to one with a smaller blast radius.

I'll commit this tomorrow morning, barring objections.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

Report-catalog_xmin-separately-to-xmin-in-hot-standb.v2.patchapplication/octet-stream; name=Report-catalog_xmin-separately-to-xmin-in-hot-standb.v2.patchDownload
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 244e381..d8786f0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1911,10 +1911,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1924,7 +1925,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 18d9d7e..0a15f4e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1175,8 +1175,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1221,54 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, false);
+	{
+		/*
+		 * Usually GetOldestXmin() would include the catalog_xmin in its
+		 * calculations, but we don't want to hold upstream back from vacuuming
+		 * normal user table tuples just because they're within the
+		 * catalog_xmin horizon of logical replication slots on this standby.
+		 * Instead we report the catalog_xmin to the upstream separately.
+		 */
+		xmin = GetOldestXminExtended(NULL,
+							 false, /* don't ignore vacuum */
+							 true /* ignore catalog xmin */);
+
+		/*
+		 * Obtain catalog_xmin to send separately, so the walsender can store
+		 * it on a physical slot's catalog_xmin if one is in use.
+		 */
+		ProcArrayGetReplicationSlotXmin(NULL, &catalog_xmin);
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7561770..05b51a0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -221,6 +221,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1605,7 +1606,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1626,6 +1627,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1636,59 +1645,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
+ */
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
+{
+	TransactionId nextXid;
+	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
  * Hot Standby feedback
  */
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	TransactionId nextXid;
-	uint32		nextEpoch;
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
-
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1713,15 +1755,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0f8f435..93d6585 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1304,6 +1304,12 @@ TransactionIdIsActive(TransactionId xid)
 TransactionId
 GetOldestXmin(Relation rel, bool ignoreVacuum)
 {
+	return GetOldestXminExtended(rel, ignoreVacuum, false);
+}
+
+TransactionId
+GetOldestXminExtended(Relation rel, bool ignoreVacuum, bool ignoreSlots)
+{
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
 	int			index;
@@ -1433,8 +1439,9 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
 	 * possible. We need to do so if we're computing the global limit (rel =
 	 * NULL) or if the passed relation is a catalog relation of some kind.
 	 */
-	if ((rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
+	if (!ignoreSlots &&
+		(rel == NULL ||
+		RelationIsAccessibleInLogicalDecoding(rel)) &&
 		TransactionIdIsValid(replication_slot_catalog_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
 		result = replication_slot_catalog_xmin;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9d5a13e..4b50ada 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -54,6 +54,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestXmin(Relation rel, bool ignoreVacuum);
+extern TransactionId GetOldestXminExtended(Relation rel, bool ignoreVacuum, bool ignoreSlots);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 09830dc..4561a06 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -20,7 +20,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 10;
 use RecursiveCopy;
 use File::Copy;
 use IPC::Run ();
@@ -31,10 +31,14 @@ my ($stdout, $stderr, $ret);
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1, has_archiving => 1);
-$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
-$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
-$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
-$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->append_conf('postgresql.conf', q[
+wal_level = 'logical'
+max_replication_slots = 3
+max_wal_senders = 2
+log_min_messages = 'debug2'
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+]);
 $node_master->dump_info;
 $node_master->start;
 
@@ -51,11 +55,17 @@ $node_master->safe_psql('postgres', 'CHECKPOINT;');
 my $backup_name = 'b1';
 $node_master->backup_fs_hot($backup_name);
 
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot');]);
+
 my $node_replica = get_new_node('replica');
 $node_replica->init_from_backup(
 	$node_master, $backup_name,
 	has_streaming => 1,
 	has_restoring => 1);
+$node_replica->append_conf(
+	'recovery.conf', q[primary_slot_name = 'phys_slot']);
+
 $node_replica->start;
 
 $node_master->safe_psql('postgres',
@@ -71,6 +81,24 @@ $stdout = $node_replica->safe_psql('postgres',
 is($stdout, 'before_basebackup',
 	'Expected to find only slot before_basebackup on replica');
 
+# Examine the physical slot the replica uses to stream changes
+# from the master to make sure its hot_standby_feedback
+# has locked in a catalog_xmin on the physical slot, and that
+# any xmin is < the catalog_xmin
+$node_master->poll_query_until('postgres', q[
+	SELECT catalog_xmin IS NOT NULL
+	FROM pg_replication_slots
+	WHERE slot_name = 'phys_slot'
+	]);
+my $phys_slot = $node_master->slot('phys_slot');
+isnt($phys_slot->{'xmin'}, '',
+	'xmin assigned on physical slot of master');
+isnt($phys_slot->{'catalog_xmin'}, '',
+	'catalog_xmin assigned on physical slot of master');
+# Ignore wrap-around here, we're on a new cluster:
+cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
+	   'xmin on physical slot must not be lower than catalog_xmin');
+
 # Boom, crash
 $node_master->stop('immediate');
 
#55Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#53)
Re: Logical decoding on standby

On 2017-03-22 15:59:42 +0000, Simon Riggs wrote:

On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:

The parts I think are important for Pg10 are:

* Ability to create logical slots on replicas

Doesn't this also imply recovery conflicts on DROP DATABASE?

Not needed until the slot is in use, which is a later patch.

Hm? We need to drop slots, if they can exist / be created, on a standby,
and they're on a dropped database. Otherwise we'll reserve resources,
while anyone connecting to the slot will likely just receive errors
because the database doesn't exist anymore. It's also one of the
patches that can quite easily be developed / reviewed, because there
really isn't anything complicated about it. Most of the code is already
in Craig's patch, it just needs some adjustments.

Besides,
allowing to drop all slots using a database upon DROP DATABASE, is a
useful thing on its own.

Sure but that's a separate feature unrelated to this patch and we're
running out of time.

Hm? The patch implemented it.

* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at al

That pretty much requires decoding, otherwise you really don't know how
much WAL you have to retain.

Knowing how much WAL to retain is key.

Why would decoding tell you how much WAL to retain?

Because decoding already has the necessary logic? (You need to retain
enough WAL to restart decoding for all in-progress transactions etc).

We tried to implement this automatically from the master, which was
rejected. So the only other way is manually. We need one or the other.

I think the current approach is roughly the right way - but that doesn't
make the patch ready.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#50)
Re: Logical decoding on standby

On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:

0002 should be doable as a whole this release, I have severe doubts that
0003 as a whole has a chance for 10 - the code is in quite a raw shape,
there's a significant number of open ends. I'd suggest breaking of bits
that are independently useful, and work on getting those committed.

That would be my preference too.

The parts I think are important for Pg10 are:

* Ability to create logical slots on replicas

Doesn't this also imply recovery conflicts on DROP DATABASE? Besides,
allowing to drop all slots using a database upon DROP DATABASE, is a
useful thing on its own.

Definitely beneficial, otherwise recovery will stop until you drop
slots, which isn't ideal.

* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at al

That pretty much requires decoding, otherwise you really don't know how
much WAL you have to retain.

Yes, and to update restart_lsn and catalog_xmin correctly.

I was thinking that by disallowing snapshot use and output plugin
invocation we'd avoid the need to support cancellation on recovery
conflicts, etc, simplifying things considerably.

* Ability to drop logical slots on replicas

That shouldn't actually require any changes, no?

It doesn't, it works as-is. I have NFI why I wrote that.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#56)
Re: Logical decoding on standby

On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:

On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:

0002 should be doable as a whole this release, I have severe doubts that
0003 as a whole has a chance for 10 - the code is in quite a raw shape,
there's a significant number of open ends. I'd suggest breaking of bits
that are independently useful, and work on getting those committed.

That would be my preference too.

The parts I think are important for Pg10 are:

* Ability to create logical slots on replicas

Doesn't this also imply recovery conflicts on DROP DATABASE? Besides,
allowing to drop all slots using a database upon DROP DATABASE, is a
useful thing on its own.

Definitely beneficial, otherwise recovery will stop until you drop
slots, which isn't ideal.

s/isn't ideal/not acceptable/ ;)

* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at al

That pretty much requires decoding, otherwise you really don't know how
much WAL you have to retain.

Yes, and to update restart_lsn and catalog_xmin correctly.

I was thinking that by disallowing snapshot use and output plugin
invocation we'd avoid the need to support cancellation on recovery
conflicts, etc, simplifying things considerably.

That seems like it'd end up being pretty hacky - the likelihood that
we'd run into snapbuild error cross-checks seems very high.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#55)
Re: Logical decoding on standby

On 23 March 2017 at 00:17, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-22 15:59:42 +0000, Simon Riggs wrote:

On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:

The parts I think are important for Pg10 are:

* Ability to create logical slots on replicas

Doesn't this also imply recovery conflicts on DROP DATABASE?

Not needed until the slot is in use, which is a later patch.

Hm? We need to drop slots, if they can exist / be created, on a standby,
and they're on a dropped database. Otherwise we'll reserve resources,
while anyone connecting to the slot will likely just receive errors
because the database doesn't exist anymore. It's also one of the
patches that can quite easily be developed / reviewed, because there
really isn't anything complicated about it. Most of the code is already
in Craig's patch, it just needs some adjustments.

Right, I'm not too concerned about doing that, and it's next on my
TODO as I clean up the split patch series.

* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at al

That pretty much requires decoding, otherwise you really don't know how
much WAL you have to retain.

Knowing how much WAL to retain is key.

Why would decoding tell you how much WAL to retain?

Because decoding already has the necessary logic? (You need to retain
enough WAL to restart decoding for all in-progress transactions etc).

Indeed; after all, standby status updates from the decoding client
only contain the *flushed* LSN. The downstream doesn't know the
restartpoint LSN, it must be tracked by the upstream.

It's also necessary to maintain our catalog_xmin correctly on the
standby so we can send it via hot_standby_feedback to a physical
replication slot used on the master, ensuring the master doesn't
remove catalog tuples we may still need.

I don't know what you're on about with that statement. I've spent a
good chunk of time looking at the 0003 patch, even though it's large
and contains a lot of different things. I suggested splitting things up.
I even suggested what to move earlier after Craig agreed with that
sentiment, in the mail you're replying to, because it seems
independently doable.

I really appreciate the review, as I'm all too aware of how time
consuming it can be.

From my PoV, the difficulty I'm in is that this patch series has
languished for most of the Pg 10 release cycle with no real input from
stakeholders in the logical decoding area, so while the review is
important, the fact that it's now means that it pretty comprehensively
blocks the patch for Pg 10. I asked on list for input on structure
(if/how to split it up) literally months ago, for example.

I've been struggling to get some kind of support for logical decoding
on standbys for most of two release cycles, and there are people
climbing the walls wanting it. I'm trying to make sure it's done
right, but I can't do that alone, and it's hard to progress when I
don't know what will be expected until it's too late to do anything
about it.

I guess all we can do at this point is get the foundations in place
and carry on for Pg 11, where the presence of in-core logical
replication will offer a lever to actually push this in. In the mean
time I'll have to continue carrying the out-of-tree failover slots
patch for people who use logical decoding and want it to be reliable.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#57)
Re: Logical decoding on standby

On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:

I was thinking that by disallowing snapshot use and output plugin
invocation we'd avoid the need to support cancellation on recovery
conflicts, etc, simplifying things considerably.

That seems like it'd end up being pretty hacky - the likelihood that
we'd run into snapbuild error cross-checks seems very high.

TBH I'm not following this. But I haven't touched snapbuild much yet,
Petr's done much more with snapbuild than I have.

We're not going to have robust logical replication that's suitable for
HA and failover use on high load systems until 2020 or so, with Pg 12.
We'll need concurrent decoding and apply, which nobody's even started
on AFAIK, we'll need sequence replication, and more.

So I'd really, really like to get some kind of HA picture other than
"none" in for logical decoding based systems. If it's imperfect, it's
still something.

I wish we'd just proceeded with failover slots. They were blocked in
favour of decoding on standby, and HA is possible if we have decoding
on standby with some more work by the application. But now we'll have
neither. If we'd just done failover slots we could've had logical
replication able to follow failover in Pg 10.

What do _you_ see as the minimum acceptable way to achieve the ability
for a logical decoding client to follow failover of an upstream to a
physical standby? In the end, you're one of the main people whose view
carries weight in this area, and I don't want to develop yet another
approach only to have it knocked back once the work is done.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#59)
Re: Logical decoding on standby

On 2017-03-23 09:14:07 +0800, Craig Ringer wrote:

On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:

I was thinking that by disallowing snapshot use and output plugin
invocation we'd avoid the need to support cancellation on recovery
conflicts, etc, simplifying things considerably.

That seems like it'd end up being pretty hacky - the likelihood that
we'd run into snapbuild error cross-checks seems very high.

TBH I'm not following this. But I haven't touched snapbuild much yet,
Petr's done much more with snapbuild than I have.

We can't just assume that snapbuild is going to work correctly when it's
prerequisites - pinned xmin horizon - isn't working.

We're not going to have robust logical replication that's suitable for
HA and failover use on high load systems until 2020 or so, with Pg 12.
We'll need concurrent decoding and apply, which nobody's even started
on AFAIK, we'll need sequence replication, and more.

These seem largely unrelated to the topic at hand(nor do I agree on all
of them).

So I'd really, really like to get some kind of HA picture other than
"none" in for logical decoding based systems. If it's imperfect, it's
still something.

I still think decoding-on-standby is simply not the right approach as
the basic/first HA approach for logical rep. It's a nice later-on
feature. But that's an irrelevant aside.

I don't understand why you're making a "fundamental" argument here - I'm
not arguing against the goals of the patch at all. I want as much stuff
committed as we can in a good shape.

What do _you_ see as the minimum acceptable way to achieve the ability
for a logical decoding client to follow failover of an upstream to a
physical standby? In the end, you're one of the main people whose view
carries weight in this area, and I don't want to develop yet another

I think your approach here wasn't that bad? There's a lot of cleaning
up/shoring up needed, and we probably need a smarter feedback system. I
don't think anybody here has objected to the fundamental approach?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#60)
Re: Logical decoding on standby

On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:

We can't just assume that snapbuild is going to work correctly when it's
prerequisites - pinned xmin horizon - isn't working.

Makes sense.

What do _you_ see as the minimum acceptable way to achieve the ability
for a logical decoding client to follow failover of an upstream to a
physical standby? In the end, you're one of the main people whose view
carries weight in this area, and I don't want to develop yet another

I think your approach here wasn't that bad? There's a lot of cleaning
up/shoring up needed, and we probably need a smarter feedback system. I
don't think anybody here has objected to the fundamental approach?

That's useful, thanks.

I'm not arguing that the patch as it stands is ready, and appreciate
the input re the general design.

I still think decoding-on-standby is simply not the right approach as
the basic/first HA approach for logical rep. It's a nice later-on
feature. But that's an irrelevant aside.

I don't really agree that it's irrelevant.

Right now Pg has no HA capability for logical decoding clients. We've
now added logical replication, but it has no way to provide for
upstream node failure and ensure a consistent switch-over, whether to
a logical or physical replica. Since real world servers fail or need
maintenance, this is kind of a problem for practical production use.

Because of transaction serialization for commit-time order replay,
logical replication experiences saw-tooth replication lag, where large
or long xacts such as batch jobs effectively stall all later xacts
until they are fully replicated. We cannot currently start replicating
a big xact until it commits on the upstream, so that lag can easily be
~2x the runtime on the upstream.

So while you can do sync rep on a logical standby, it tends to result
in big delays on COMMITs relative to physical rep, even if app are
careful to keep transactions small. When the app DR planning people
come and ask you what the max data loss window / max sync rep lag is,
you have to say ".... dunno? depends on what else was running on the
server at the time."

AFAICT, changing those things will require the ability to begin
streaming reorder buffers for big xacts before commit, which as the
logical decoding on 2PC thread shows is not exactly trivial. We'll
also need to be able to apply them concurrently with other xacts on
the other end. Those are both big and complex things IMO, and I'll be
surprised if we can do either in Pg11 given that AFAIK nobody has even
started work on either of them or has a detailed plan.

Presuming we get some kind of failover to logical replica upstreams
into Pg11, it'll have significant limitations relative to what we can
deliver to users by using physical replication. Especially when it
comes to bounded-time lag for HA, sync rep, etc. And I haven't seen a
design for it, though Petr and I have discussed some with regards to
pglogical.

That's why I think we need to do HA on the physical side first.
Because it's going to take a long time to get equivalent functionality
for logical rep based upstreams, and when it is we'll still have to
teach management tools and other non-logical-rep logical decoding
clients about the new way of doing things. Wheras for physical HA
setups to support logical downstreams requires only relatively minor
changes and gets us all the physical HA features _now_.

That's why we pursued failover slots - as a simple, minimal solution
to allowing logical decoding clients to inter-operate with Pg in a
physical HA configuration. TBH, I still think we should just add them.
Sure, they don't help us achieve decoding on standby, but they're a
lot simpler and they help Pg's behaviour with slots match user
expectations for how the rest of Pg behaves, i.e. if it's on the
master it'll be on the replica too. And as you've said, decoding on
standby is a nice-to-have, wheras I think some kind of HA support is
rather more important.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#54)
Re: Logical decoding on standby

On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:

I'm splitting up the rest of the decoding on standby patch set with
the goal of getting minimal functionality for creating and managing
slots on standbys in, so we can maintain slots on standbys and use
them when the standby is promoted to master.

The first, to send catalog_xmin separately to the global xmin on
hot_standby_feedback and store it in the upstream physical slot's
catalog_xmin, is attached.

These are extracted directly from the logical decoding on standby
patch, with comments by Petr and Andres made re the relevant code
addressed.

I've reduced your two patches back to one with a smaller blast radius.

I'll commit this tomorrow morning, barring objections.

Thanks. I was tempted to refactor GetOldestXmin to use flags myself,
but thought it might be at higher risk of objections. Since Eiji Seki
has shown that there are other uses for excluding particular things
from GetOldestXmin it and that's committed now, it's nice to have the
impact of this patch reduced.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#61)
Re: Logical decoding on standby

On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:

On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:

I still think decoding-on-standby is simply not the right approach as
the basic/first HA approach for logical rep. It's a nice later-on
feature. But that's an irrelevant aside.

I don't really agree that it's irrelevant.

I'm not sure we have enough time for either getting some parts of your
patch in, or for figuring out long term goals. But we definitely don't
have time for both.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#63)
Re: Logical decoding on standby

On 23 March 2017 at 12:41, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:

On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:

I still think decoding-on-standby is simply not the right approach as
the basic/first HA approach for logical rep. It's a nice later-on
feature. But that's an irrelevant aside.

I don't really agree that it's irrelevant.

I'm not sure we have enough time for either getting some parts of your
patch in, or for figuring out long term goals. But we definitely don't
have time for both.

Fair.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#54)
1 attachment(s)
Re: Logical decoding on standby

On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:

I'm splitting up the rest of the decoding on standby patch set with
the goal of getting minimal functionality for creating and managing
slots on standbys in, so we can maintain slots on standbys and use
them when the standby is promoted to master.

The first, to send catalog_xmin separately to the global xmin on
hot_standby_feedback and store it in the upstream physical slot's
catalog_xmin, is attached.

These are extracted directly from the logical decoding on standby
patch, with comments by Petr and Andres made re the relevant code
addressed.

I've reduced your two patches back to one with a smaller blast radius.

I'll commit this tomorrow morning, barring objections.

This needs rebasing on top of

commit af4b1a0869bd3bb52e5f662e4491554b7f611489
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Wed Mar 22 16:51:01 2017 +0000

Refactor GetOldestXmin() to use flags

Replace ignoreVacuum parameter with more flexible flags.

Author: Eiji Seki
Review: Haribabu Kommi

That patch landed up using PROCARRAY flags directly as flags to
GetOldestXmin, so it doesn't make much sense to add a flag like
PROCARRAY_REPLICATION_SLOTS . There won't be any corresponding PROC_
flag for PGXACT->vacuumFlags, replication slot xmin and catalog_xmin
are global state not tracked in individual proc entries.

Rather than add some kind of "PROC_RESERVED" flag in proc.h that would
never be used and only exist to reserve a bit for use for
PROCARRAY_REPLICATION_SLOTS, which we'd use a flag to GetOldestXmin, I
added a new argument to GetOldestXmin like the prior patch did.

If preferred I can instead add

proc.h:

#define PROC_RESERVED 0x20

procarray.h:

#define PROCARRAY_REPLICATION_SLOTS 0x20

and then test for (flags & PROCARRAY_REPLICATION_SLOTS)

but that's kind of ugly to say the least, I'd rather just add another argument.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchtext/x-patch; charset=US-ASCII; name=0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchDownload
From ffa43fae35857dbff0efe83ef199df165d887d97 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 12:29:13 +0800
Subject: [PATCH] Report catalog_xmin separately to xmin in hot standby
 feedback

The catalog_xmin of slots on a standby was reported as part of the standby's
xmin, causing the master's xmin to be held down. This could cause considerable
unnecessary bloat on the master.

Instead, report catalog_xmin as a separate field in hot_standby_feedback. If
the upstream walsender is using a physical replication slot, store the
catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
slot and has only a PGPROC entry behaviour doesn't change, as we store the
combined xmin and catalog_xmin in the PGPROC entry.

There's no backward compatibility concern here, as nothing except another
postgres instance of the same major version has any business sending hot
standby feedback and it's only used on the physical replication protocol.
---
 contrib/pg_visibility/pg_visibility.c              |   6 +-
 contrib/pgstattuple/pgstatapprox.c                 |   2 +-
 doc/src/sgml/protocol.sgml                         |  33 ++++++-
 src/backend/access/transam/xlog.c                  |   4 +-
 src/backend/catalog/index.c                        |   3 +-
 src/backend/commands/analyze.c                     |   2 +-
 src/backend/commands/vacuum.c                      |   5 +-
 src/backend/replication/walreceiver.c              |  44 +++++++--
 src/backend/replication/walsender.c                | 110 +++++++++++++++------
 src/backend/storage/ipc/procarray.c                |  11 ++-
 src/include/storage/procarray.h                    |   2 +-
 .../recovery/t/010_logical_decoding_timelines.pl   |  38 ++++++-
 12 files changed, 198 insertions(+), 62 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index ee3936e..2db5762 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -557,7 +557,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 	{
 		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM, false);
 	}
 
 	rel = relation_open(relid, AccessShareLock);
@@ -674,7 +674,9 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * a buffer lock. And this shouldn't happen often, so it's
 				 * worth being careful so as to avoid false positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+				RecomputedOldestXmin = GetOldestXmin(NULL,
+													 PROCARRAY_FLAGS_VACUUM,
+													 false);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 46c167a..5d0eda3 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	TransactionId OldestXmin;
 	uint64		misc_count = 0;
 
-	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM, false);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 244e381..d8786f0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1911,10 +1911,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1924,7 +1925,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff4cf3a..ed5ee90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8895,7 +8895,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT, false));
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9258,7 +9258,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT, false));
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7924c30..69ce21c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2270,7 +2270,8 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	{
 		snapshot = SnapshotAny;
 		/* okay to ignore lazy VACUUMs here */
-		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM);
+		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM,
+								   false);
 	}
 
 	scan = heap_beginscan_strat(heapRelation,	/* relation */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 055338f..b005dc1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1000,7 +1000,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM, false);
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d8ae3e1..d8722c7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -527,7 +527,8 @@ vacuum_set_xid_limits(Relation rel,
 	 * always an independent transaction.
 	 */
 	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+		TransactionIdLimitedForOldSnapshots(
+			GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM, false), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -939,7 +940,7 @@ vac_update_datfrozenxid(void)
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
 	 * So we cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+	newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM, false);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 31c567b..06199ad 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1175,8 +1175,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1221,55 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	{
+		TransactionId slot_xmin;
+
+		/*
+		 * Usually GetOldestXmin() would include the global replication slot
+		 * xmin and catalog_xmin in its calculations, but we don't want to hold
+		 * upstream back from vacuuming normal user table tuples just because
+		 * they're within the catalog_xmin horizon of logical replication slots
+		 * on this standby, so we ignore slot xmin and catalog_xmin GetOldestXmin
+		 * then deal with them ourselves.
+		 */
+		xmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT, true);
+
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+
+		if (TransactionIdIsValid(slot_xmin) &&
+			TransactionIdPrecedes(slot_xmin, xmin))
+			xmin = slot_xmin;
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7561770..05b51a0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -221,6 +221,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1605,7 +1606,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1626,6 +1627,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1636,59 +1645,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
- * Hot Standby feedback
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
  */
-static void
-ProcessStandbyHSFeedbackMessage(void)
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
 {
 	TransactionId nextXid;
 	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
+ * Hot Standby feedback
+ */
+static void
+ProcessStandbyHSFeedbackMessage(void)
+{
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
-
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1713,15 +1755,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 40c3247..1401efe 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1264,6 +1264,9 @@ TransactionIdIsActive(TransactionId xid)
  * corresponding flags is set. Typically, if you want to ignore ones with
  * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
  *
+ * The include_slots option may be set to false to ignore replication slot
+ * xmin and catalog_xmin when calculating the oldest xmin.
+ *
  * This is used by VACUUM to decide which deleted tuples must be preserved in
  * the passed in table. For shared relations backends in all databases must be
  * considered, but for non-shared relations that's not required, since only
@@ -1304,7 +1307,7 @@ TransactionIdIsActive(TransactionId xid)
  * GetOldestXmin() move backwards, with no consequences for data integrity.
  */
 TransactionId
-GetOldestXmin(Relation rel, int flags)
+GetOldestXmin(Relation rel, int flags, bool ignore_slots)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
@@ -1418,7 +1421,8 @@ GetOldestXmin(Relation rel, int flags)
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
+	if (!ignore_slots &&
+		TransactionIdIsValid(replication_slot_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
@@ -1428,7 +1432,8 @@ GetOldestXmin(Relation rel, int flags)
 	 * possible. We need to do so if we're computing the global limit (rel =
 	 * NULL) or if the passed relation is a catalog relation of some kind.
 	 */
-	if ((rel == NULL ||
+	if (!ignore_slots &&
+		(rel == NULL ||
 		 RelationIsAccessibleInLogicalDecoding(rel)) &&
 		TransactionIdIsValid(replication_slot_catalog_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index c8e1ae5..0e8f53e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -75,7 +75,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, int flags);
+extern TransactionId GetOldestXmin(Relation rel, int flags, bool ignore_slots);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(void);
 
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 09830dc..4561a06 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -20,7 +20,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 10;
 use RecursiveCopy;
 use File::Copy;
 use IPC::Run ();
@@ -31,10 +31,14 @@ my ($stdout, $stderr, $ret);
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1, has_archiving => 1);
-$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
-$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
-$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
-$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->append_conf('postgresql.conf', q[
+wal_level = 'logical'
+max_replication_slots = 3
+max_wal_senders = 2
+log_min_messages = 'debug2'
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+]);
 $node_master->dump_info;
 $node_master->start;
 
@@ -51,11 +55,17 @@ $node_master->safe_psql('postgres', 'CHECKPOINT;');
 my $backup_name = 'b1';
 $node_master->backup_fs_hot($backup_name);
 
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot');]);
+
 my $node_replica = get_new_node('replica');
 $node_replica->init_from_backup(
 	$node_master, $backup_name,
 	has_streaming => 1,
 	has_restoring => 1);
+$node_replica->append_conf(
+	'recovery.conf', q[primary_slot_name = 'phys_slot']);
+
 $node_replica->start;
 
 $node_master->safe_psql('postgres',
@@ -71,6 +81,24 @@ $stdout = $node_replica->safe_psql('postgres',
 is($stdout, 'before_basebackup',
 	'Expected to find only slot before_basebackup on replica');
 
+# Examine the physical slot the replica uses to stream changes
+# from the master to make sure its hot_standby_feedback
+# has locked in a catalog_xmin on the physical slot, and that
+# any xmin is < the catalog_xmin
+$node_master->poll_query_until('postgres', q[
+	SELECT catalog_xmin IS NOT NULL
+	FROM pg_replication_slots
+	WHERE slot_name = 'phys_slot'
+	]);
+my $phys_slot = $node_master->slot('phys_slot');
+isnt($phys_slot->{'xmin'}, '',
+	'xmin assigned on physical slot of master');
+isnt($phys_slot->{'catalog_xmin'}, '',
+	'catalog_xmin assigned on physical slot of master');
+# Ignore wrap-around here, we're on a new cluster:
+cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
+	   'xmin on physical slot must not be lower than catalog_xmin');
+
 # Boom, crash
 $node_master->stop('immediate');
 
-- 
2.5.5

#66Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#65)
1 attachment(s)
Re: Logical decoding on standby

On 23 March 2017 at 16:07, Craig Ringer <craig@2ndquadrant.com> wrote:

If preferred I can instead add

proc.h:

#define PROC_RESERVED 0x20

procarray.h:

#define PROCARRAY_REPLICATION_SLOTS 0x20

and then test for (flags & PROCARRAY_REPLICATION_SLOTS)

Attached done that way.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchtext/x-patch; charset=US-ASCII; name=0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchDownload
From 2e887ee19c9c1bae442b9f0682169f9b0c61268a Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 12:29:13 +0800
Subject: [PATCH] Report catalog_xmin separately to xmin in hot standby
 feedback

The catalog_xmin of slots on a standby was reported as part of the standby's
xmin, causing the master's xmin to be held down. This could cause considerable
unnecessary bloat on the master.

Instead, report catalog_xmin as a separate field in hot_standby_feedback. If
the upstream walsender is using a physical replication slot, store the
catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
slot and has only a PGPROC entry behaviour doesn't change, as we store the
combined xmin and catalog_xmin in the PGPROC entry.

There's no backward compatibility concern here, as nothing except another
postgres instance of the same major version has any business sending hot
standby feedback and it's only used on the physical replication protocol.
---
 doc/src/sgml/protocol.sgml                         |  33 ++++++-
 src/backend/replication/walreceiver.c              |  45 +++++++--
 src/backend/replication/walsender.c                | 110 +++++++++++++++------
 src/backend/storage/ipc/procarray.c                |  12 ++-
 src/include/storage/proc.h                         |   5 +
 src/include/storage/procarray.h                    |  11 +++
 .../recovery/t/010_logical_decoding_timelines.pl   |  38 ++++++-
 7 files changed, 202 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 244e381..d8786f0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1911,10 +1911,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1924,7 +1925,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 31c567b..0f22f17 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1175,8 +1175,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1221,56 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	{
+		TransactionId slot_xmin;
+
+		/*
+		 * Usually GetOldestXmin() would include the global replication slot
+		 * xmin and catalog_xmin in its calculations, but we don't want to hold
+		 * upstream back from vacuuming normal user table tuples just because
+		 * they're within the catalog_xmin horizon of logical replication slots
+		 * on this standby, so we ignore slot xmin and catalog_xmin GetOldestXmin
+		 * then deal with them ourselves.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
+
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+
+		if (TransactionIdIsValid(slot_xmin) &&
+			TransactionIdPrecedes(slot_xmin, xmin))
+			xmin = slot_xmin;
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7561770..05b51a0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -221,6 +221,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1605,7 +1606,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1626,6 +1627,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1636,59 +1645,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
- * Hot Standby feedback
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
  */
-static void
-ProcessStandbyHSFeedbackMessage(void)
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
 {
 	TransactionId nextXid;
 	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
+ * Hot Standby feedback
+ */
+static void
+ProcessStandbyHSFeedbackMessage(void)
+{
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
-
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1713,15 +1755,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 40c3247..7c2e1e1 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1264,6 +1264,10 @@ TransactionIdIsActive(TransactionId xid)
  * corresponding flags is set. Typically, if you want to ignore ones with
  * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
  *
+ * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
+ * catalog_xmin of any replication slots that exist in the system when
+ * calculating the oldest xmin.
+ *
  * This is used by VACUUM to decide which deleted tuples must be preserved in
  * the passed in table. For shared relations backends in all databases must be
  * considered, but for non-shared relations that's not required, since only
@@ -1342,7 +1346,7 @@ GetOldestXmin(Relation rel, int flags)
 		volatile PGPROC *proc = &allProcs[pgprocno];
 		volatile PGXACT *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->vacuumFlags & flags)
+		if (pgxact->vacuumFlags & (flags & PROCARRAY_PROC_FLAGS_MASK))
 			continue;
 
 		if (allDbs ||
@@ -1418,7 +1422,8 @@ GetOldestXmin(Relation rel, int flags)
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
+	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
+		TransactionIdIsValid(replication_slot_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
@@ -1428,7 +1433,8 @@ GetOldestXmin(Relation rel, int flags)
 	 * possible. We need to do so if we're computing the global limit (rel =
 	 * NULL) or if the passed relation is a catalog relation of some kind.
 	 */
-	if ((rel == NULL ||
+	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
+		(rel == NULL ||
 		 RelationIsAccessibleInLogicalDecoding(rel)) &&
 		TransactionIdIsValid(replication_slot_catalog_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 945dd1d..1b345fa 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -44,6 +44,10 @@ struct XidCache
  *
  * Note: If you modify these flags, you need to modify PROCARRAY_XXX flags
  * in src/include/storage/procarray.h.
+ *
+ * PROC_RESERVED may later be assigned for use in vacuumFlags, but its value is
+ * used for PROCARRAY_SLOTS_XMIN in procarray.h, so GetOldestXmin won't be able
+ * to match and ignore processes with this flag set.
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -51,6 +55,7 @@ struct XidCache
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
+#define		PROC_RESERVED				0x20	/* reserved for procarray */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index c8e1ae5..076f233 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -32,6 +32,17 @@
 #define		PROCARRAY_LOGICAL_DECODING_FLAG	0x10	/* currently doing logical
 													 * decoding outside xact */
 
+#define		PROCARRAY_SLOTS_XMIN			0x20	/* replication slot xmin,
+													 * catalog_xmin */
+/*
+ * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
+ * PGXACT->vacuumFlags. Other flags are used for different purposes and
+ * have no corresponding PROC flag equivalent.
+ */
+#define		PROCARRAY_PROC_FLAGS_MASK	PROCARRAY_VACUUM_FLAG | \
+										PROCARRAY_ANALYZE_FLAG | \
+										PROCARRAY_LOGICAL_DECODING_FLAG
+
 /* Use the following flags as an input "flags" to GetOldestXmin function */
 /* Consider all backends except for logical decoding ones which manage xmin separately */
 #define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 09830dc..4561a06 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -20,7 +20,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 10;
 use RecursiveCopy;
 use File::Copy;
 use IPC::Run ();
@@ -31,10 +31,14 @@ my ($stdout, $stderr, $ret);
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1, has_archiving => 1);
-$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
-$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
-$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
-$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->append_conf('postgresql.conf', q[
+wal_level = 'logical'
+max_replication_slots = 3
+max_wal_senders = 2
+log_min_messages = 'debug2'
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+]);
 $node_master->dump_info;
 $node_master->start;
 
@@ -51,11 +55,17 @@ $node_master->safe_psql('postgres', 'CHECKPOINT;');
 my $backup_name = 'b1';
 $node_master->backup_fs_hot($backup_name);
 
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot');]);
+
 my $node_replica = get_new_node('replica');
 $node_replica->init_from_backup(
 	$node_master, $backup_name,
 	has_streaming => 1,
 	has_restoring => 1);
+$node_replica->append_conf(
+	'recovery.conf', q[primary_slot_name = 'phys_slot']);
+
 $node_replica->start;
 
 $node_master->safe_psql('postgres',
@@ -71,6 +81,24 @@ $stdout = $node_replica->safe_psql('postgres',
 is($stdout, 'before_basebackup',
 	'Expected to find only slot before_basebackup on replica');
 
+# Examine the physical slot the replica uses to stream changes
+# from the master to make sure its hot_standby_feedback
+# has locked in a catalog_xmin on the physical slot, and that
+# any xmin is < the catalog_xmin
+$node_master->poll_query_until('postgres', q[
+	SELECT catalog_xmin IS NOT NULL
+	FROM pg_replication_slots
+	WHERE slot_name = 'phys_slot'
+	]);
+my $phys_slot = $node_master->slot('phys_slot');
+isnt($phys_slot->{'xmin'}, '',
+	'xmin assigned on physical slot of master');
+isnt($phys_slot->{'catalog_xmin'}, '',
+	'catalog_xmin assigned on physical slot of master');
+# Ignore wrap-around here, we're on a new cluster:
+cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
+	   'xmin on physical slot must not be lower than catalog_xmin');
+
 # Boom, crash
 $node_master->stop('immediate');
 
-- 
2.5.5

#67Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#66)
1 attachment(s)
Re: Logical decoding on standby

On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote:

Minor update to catalog_xmin walsender patch to fix failure to
parenthesize definition of PROCARRAY_PROC_FLAGS_MASK .

This one's ready to go. Working on drop slots on DB drop now.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchtext/x-patch; charset=US-ASCII; name=0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patchDownload
From b5e34ecaa8f43825fe41ae2e2bbf0a97258cb56a Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 12:29:13 +0800
Subject: [PATCH] Report catalog_xmin separately to xmin in hot standby
 feedback

The catalog_xmin of slots on a standby was reported as part of the standby's
xmin, causing the master's xmin to be held down. This could cause considerable
unnecessary bloat on the master.

Instead, report catalog_xmin as a separate field in hot_standby_feedback. If
the upstream walsender is using a physical replication slot, store the
catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
slot and has only a PGPROC entry behaviour doesn't change, as we store the
combined xmin and catalog_xmin in the PGPROC entry.

There's no backward compatibility concern here, as nothing except another
postgres instance of the same major version has any business sending hot
standby feedback and it's only used on the physical replication protocol.
---
 doc/src/sgml/protocol.sgml                         |  33 ++++++-
 src/backend/replication/walreceiver.c              |  45 +++++++--
 src/backend/replication/walsender.c                | 110 +++++++++++++++------
 src/backend/storage/ipc/procarray.c                |  12 ++-
 src/include/storage/proc.h                         |   5 +
 src/include/storage/procarray.h                    |  11 +++
 .../recovery/t/010_logical_decoding_timelines.pl   |  38 ++++++-
 7 files changed, 202 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 48ca414..b3a5026 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1916,10 +1916,11 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current xmin. This may be 0, if the standby is
-          sending notification that Hot Standby feedback will no longer
-          be sent on this connection. Later non-zero messages may
-          reinitiate the feedback mechanism.
+          The standby's current global xmin, excluding the catalog_xmin from any
+          replication slots. If both this value and the following
+          catalog_xmin are 0 this is treated as a notification that Hot Standby
+          feedback will no longer be sent on this connection. Later non-zero
+          messages may reinitiate the feedback mechanism.
       </para>
       </listitem>
       </varlistentry>
@@ -1929,7 +1930,29 @@ The commands accepted in walsender mode are:
       </term>
       <listitem>
       <para>
-          The standby's current epoch.
+          The epoch of the global xmin xid on the standby.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The lowest catalog_xmin of any replication slots on the standby. Set to 0
+          if no catalog_xmin exists on the standby or if hot standby feedback is being
+          disabled.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Int32
+      </term>
+      <listitem>
+      <para>
+          The epoch of the catalog_xmin xid on the standby.
       </para>
       </listitem>
       </varlistentry>
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 31c567b..0f22f17 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1175,8 +1175,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 {
 	TimestampTz now;
 	TransactionId nextXid;
-	uint32		nextEpoch;
-	TransactionId xmin;
+	uint32		xmin_epoch, catalog_xmin_epoch;
+	TransactionId xmin, catalog_xmin;
 	static TimestampTz sendTime = 0;
 	/* initially true so we always send at least one feedback message */
 	static bool master_has_standby_xmin = true;
@@ -1221,29 +1221,56 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 * everything else has been checked.
 	 */
 	if (hot_standby_feedback)
-		xmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	{
+		TransactionId slot_xmin;
+
+		/*
+		 * Usually GetOldestXmin() would include the global replication slot
+		 * xmin and catalog_xmin in its calculations, but we don't want to hold
+		 * upstream back from vacuuming normal user table tuples just because
+		 * they're within the catalog_xmin horizon of logical replication slots
+		 * on this standby, so we ignore slot xmin and catalog_xmin GetOldestXmin
+		 * then deal with them ourselves.
+		 */
+		xmin = GetOldestXmin(NULL,
+							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
+
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+
+		if (TransactionIdIsValid(slot_xmin) &&
+			TransactionIdPrecedes(slot_xmin, xmin))
+			xmin = slot_xmin;
+	}
 	else
+	{
 		xmin = InvalidTransactionId;
+		catalog_xmin = InvalidTransactionId;
+	}
 
 	/*
 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
 	 * the epoch boundary.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	GetNextXidAndEpoch(&nextXid, &xmin_epoch);
+	catalog_xmin_epoch = xmin_epoch;
 	if (nextXid < xmin)
-		nextEpoch--;
+		xmin_epoch --;
+	if (nextXid < catalog_xmin)
+		catalog_xmin_epoch --;
 
-	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 xmin, nextEpoch);
+	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u catalog_xmin %u catalog_xmin_epoch %u",
+		 xmin, xmin_epoch, catalog_xmin, catalog_xmin_epoch);
 
 	/* Construct the message and send it. */
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'h');
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendint(&reply_message, xmin, 4);
-	pq_sendint(&reply_message, nextEpoch, 4);
+	pq_sendint(&reply_message, xmin_epoch, 4);
+	pq_sendint(&reply_message, catalog_xmin, 4);
+	pq_sendint(&reply_message, catalog_xmin_epoch, 4);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
-	if (TransactionIdIsValid(xmin))
+	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
 		master_has_standby_xmin = true;
 	else
 		master_has_standby_xmin = false;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a29d0e7..59ae22d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -242,6 +242,7 @@ static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, Tran
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
+static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -1756,7 +1757,7 @@ ProcessStandbyReplyMessage(void)
 
 /* compute new replication slot xmin horizon if needed */
 static void
-PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
+PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbackCatalogXmin)
 {
 	bool		changed = false;
 	ReplicationSlot *slot = MyReplicationSlot;
@@ -1777,6 +1778,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
+		!TransactionIdIsNormal(feedbackCatalogXmin) ||
+		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
+	{
+		changed = true;
+		slot->data.catalog_xmin = feedbackCatalogXmin;
+		slot->effective_catalog_xmin = feedbackCatalogXmin;
+	}
 	SpinLockRelease(&slot->mutex);
 
 	if (changed)
@@ -1787,59 +1796,92 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 }
 
 /*
- * Hot Standby feedback
+ * Check that the provided xmin/epoch are sane, that is, not in the future
+ * and not so far back as to be already wrapped around.
+ *
+ * Epoch of nextXid should be same as standby, or if the counter has
+ * wrapped, then one greater than standby.
+ *
+ * This check doesn't care about whether clog exists for these xids
+ * at all.
  */
-static void
-ProcessStandbyHSFeedbackMessage(void)
+static bool
+TransactionIdInRecentPast(TransactionId xid, uint32 epoch)
 {
 	TransactionId nextXid;
 	uint32		nextEpoch;
+
+	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+	if (xid <= nextXid)
+	{
+		if (epoch != nextEpoch)
+			return false;
+	}
+	else
+	{
+		if (epoch + 1 != nextEpoch)
+			return false;
+	}
+
+	if (!TransactionIdPrecedesOrEquals(xid, nextXid))
+		return false;				/* epoch OK, but it's wrapped around */
+
+	return true;
+}
+
+/*
+ * Hot Standby feedback
+ */
+static void
+ProcessStandbyHSFeedbackMessage(void)
+{
 	TransactionId feedbackXmin;
 	uint32		feedbackEpoch;
+	TransactionId feedbackCatalogXmin;
+	uint32		feedbackCatalogEpoch;
 
 	/*
 	 * Decipher the reply message. The caller already consumed the msgtype
-	 * byte.
+	 * byte. See XLogWalRcvSendHSFeedback() in walreceiver.c for the creation
+	 * of this message.
 	 */
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
 	feedbackXmin = pq_getmsgint(&reply_message, 4);
 	feedbackEpoch = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
+	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
+	elog(DEBUG2, "hot standby feedback xmin %u epoch %u, catalog_xmin %u epoch %u",
 		 feedbackXmin,
-		 feedbackEpoch);
+		 feedbackEpoch,
+		 feedbackCatalogXmin,
+		 feedbackCatalogEpoch);
 
-	/* Unset WalSender's xmin if the feedback message value is invalid */
-	if (!TransactionIdIsNormal(feedbackXmin))
+	/*
+	 * Unset WalSender's xmins if the feedback message values are invalid.
+	 * This happens when the downstream turned hot_standby_feedback off.
+	 */
+	if (!TransactionIdIsNormal(feedbackXmin)
+		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
-			PhysicalReplicationSlotNewXmin(feedbackXmin);
+			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
 	}
 
 	/*
 	 * Check that the provided xmin/epoch are sane, that is, not in the future
 	 * and not so far back as to be already wrapped around.  Ignore if not.
-	 *
-	 * Epoch of nextXid should be same as standby, or if the counter has
-	 * wrapped, then one greater than standby.
 	 */
-	GetNextXidAndEpoch(&nextXid, &nextEpoch);
+	if (TransactionIdIsNormal(feedbackXmin) &&
+		!TransactionIdInRecentPast(feedbackXmin, feedbackEpoch))
+		return;
 
-	if (feedbackXmin <= nextXid)
-	{
-		if (feedbackEpoch != nextEpoch)
-			return;
-	}
-	else
-	{
-		if (feedbackEpoch + 1 != nextEpoch)
-			return;
-	}
-
-	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
-		return;					/* epoch OK, but it's wrapped around */
+	if (TransactionIdIsNormal(feedbackCatalogXmin) &&
+		!TransactionIdInRecentPast(feedbackCatalogXmin, feedbackCatalogEpoch))
+		return;
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
@@ -1864,15 +1906,23 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry.
+	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * catalog xmin separately when using a slot, so we store the least
+	 * of the two provided when not using a slot.
 	 *
 	 * XXX: It might make sense to generalize the ephemeral slot concept and
 	 * always use the slot mechanism to handle the feedback xmin.
 	 */
 	if (MyReplicationSlot != NULL)		/* XXX: persistency configurable? */
-		PhysicalReplicationSlotNewXmin(feedbackXmin);
+		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
-		MyPgXact->xmin = feedbackXmin;
+	{
+		if (TransactionIdIsNormal(feedbackCatalogXmin)
+			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
+			MyPgXact->xmin = feedbackCatalogXmin;
+		else
+			MyPgXact->xmin = feedbackXmin;
+	}
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 40c3247..7c2e1e1 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1264,6 +1264,10 @@ TransactionIdIsActive(TransactionId xid)
  * corresponding flags is set. Typically, if you want to ignore ones with
  * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
  *
+ * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
+ * catalog_xmin of any replication slots that exist in the system when
+ * calculating the oldest xmin.
+ *
  * This is used by VACUUM to decide which deleted tuples must be preserved in
  * the passed in table. For shared relations backends in all databases must be
  * considered, but for non-shared relations that's not required, since only
@@ -1342,7 +1346,7 @@ GetOldestXmin(Relation rel, int flags)
 		volatile PGPROC *proc = &allProcs[pgprocno];
 		volatile PGXACT *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->vacuumFlags & flags)
+		if (pgxact->vacuumFlags & (flags & PROCARRAY_PROC_FLAGS_MASK))
 			continue;
 
 		if (allDbs ||
@@ -1418,7 +1422,8 @@ GetOldestXmin(Relation rel, int flags)
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
+	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
+		TransactionIdIsValid(replication_slot_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_xmin, result))
 		result = replication_slot_xmin;
 
@@ -1428,7 +1433,8 @@ GetOldestXmin(Relation rel, int flags)
 	 * possible. We need to do so if we're computing the global limit (rel =
 	 * NULL) or if the passed relation is a catalog relation of some kind.
 	 */
-	if ((rel == NULL ||
+	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
+		(rel == NULL ||
 		 RelationIsAccessibleInLogicalDecoding(rel)) &&
 		TransactionIdIsValid(replication_slot_catalog_xmin) &&
 		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 945dd1d..1b345fa 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -44,6 +44,10 @@ struct XidCache
  *
  * Note: If you modify these flags, you need to modify PROCARRAY_XXX flags
  * in src/include/storage/procarray.h.
+ *
+ * PROC_RESERVED may later be assigned for use in vacuumFlags, but its value is
+ * used for PROCARRAY_SLOTS_XMIN in procarray.h, so GetOldestXmin won't be able
+ * to match and ignore processes with this flag set.
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -51,6 +55,7 @@ struct XidCache
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
+#define		PROC_RESERVED				0x20	/* reserved for procarray */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index c8e1ae5..9b42e49 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -32,6 +32,17 @@
 #define		PROCARRAY_LOGICAL_DECODING_FLAG	0x10	/* currently doing logical
 													 * decoding outside xact */
 
+#define		PROCARRAY_SLOTS_XMIN			0x20	/* replication slot xmin,
+													 * catalog_xmin */
+/*
+ * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
+ * PGXACT->vacuumFlags. Other flags are used for different purposes and
+ * have no corresponding PROC flag equivalent.
+ */
+#define		PROCARRAY_PROC_FLAGS_MASK	(PROCARRAY_VACUUM_FLAG | \
+										 PROCARRAY_ANALYZE_FLAG | \
+										 PROCARRAY_LOGICAL_DECODING_FLAG)
+
 /* Use the following flags as an input "flags" to GetOldestXmin function */
 /* Consider all backends except for logical decoding ones which manage xmin separately */
 #define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 09830dc..4561a06 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -20,7 +20,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 10;
 use RecursiveCopy;
 use File::Copy;
 use IPC::Run ();
@@ -31,10 +31,14 @@ my ($stdout, $stderr, $ret);
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1, has_archiving => 1);
-$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
-$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
-$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
-$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->append_conf('postgresql.conf', q[
+wal_level = 'logical'
+max_replication_slots = 3
+max_wal_senders = 2
+log_min_messages = 'debug2'
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+]);
 $node_master->dump_info;
 $node_master->start;
 
@@ -51,11 +55,17 @@ $node_master->safe_psql('postgres', 'CHECKPOINT;');
 my $backup_name = 'b1';
 $node_master->backup_fs_hot($backup_name);
 
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot');]);
+
 my $node_replica = get_new_node('replica');
 $node_replica->init_from_backup(
 	$node_master, $backup_name,
 	has_streaming => 1,
 	has_restoring => 1);
+$node_replica->append_conf(
+	'recovery.conf', q[primary_slot_name = 'phys_slot']);
+
 $node_replica->start;
 
 $node_master->safe_psql('postgres',
@@ -71,6 +81,24 @@ $stdout = $node_replica->safe_psql('postgres',
 is($stdout, 'before_basebackup',
 	'Expected to find only slot before_basebackup on replica');
 
+# Examine the physical slot the replica uses to stream changes
+# from the master to make sure its hot_standby_feedback
+# has locked in a catalog_xmin on the physical slot, and that
+# any xmin is < the catalog_xmin
+$node_master->poll_query_until('postgres', q[
+	SELECT catalog_xmin IS NOT NULL
+	FROM pg_replication_slots
+	WHERE slot_name = 'phys_slot'
+	]);
+my $phys_slot = $node_master->slot('phys_slot');
+isnt($phys_slot->{'xmin'}, '',
+	'xmin assigned on physical slot of master');
+isnt($phys_slot->{'catalog_xmin'}, '',
+	'catalog_xmin assigned on physical slot of master');
+# Ignore wrap-around here, we're on a new cluster:
+cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
+	   'xmin on physical slot must not be lower than catalog_xmin');
+
 # Boom, crash
 $node_master->stop('immediate');
 
-- 
2.5.5

#68Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#67)
Re: Logical decoding on standby

On 24 March 2017 at 06:23, Craig Ringer <craig@2ndquadrant.com> wrote:

On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote:

Minor update to catalog_xmin walsender patch to fix failure to
parenthesize definition of PROCARRAY_PROC_FLAGS_MASK .

This one's ready to go. Working on drop slots on DB drop now.

Committed. Next!

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#42)
Re: Logical decoding on standby

On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:

Have you checked how high the overhead of XLogReadDetermineTimeline is?
A non-local function call, especially into a different translation-unit
(no partial inlining), for every single page might end up being
noticeable. That's fine in the cases it actually adds functionality,
but for a master streaming out data, that's not actually adding
anything.

I haven't been able to measure any difference. But, since we require
the caller to ensure a reasonably up to date ThisTimeLineID, maybe
it's worth adding an inlineable function for the fast-path that tests
the "page cached" and "timeline is current and unchanged" conditions?

//xlogutils.h:
static inline void XLogReadDetermineTimeline(...)
{
... first test for page already read-in and valid ...
... second test for ThisTimeLineId ...
XLogReadCheckTimeLineChange(...)
}

XLogReadCheckTimeLineChange(...)
{
... rest of function
}

(Yes, I know "inline" means little, but it's a hint for readers)

I'd rather avoid using a macro since it'd be pretty ugly, but it's
also an option if an inline func is undesirable.

#define XLOG_READ_DETERMINE_TIMELINE \
do { \
... same as above ...
} while (0);

Can be done after CF if needed anyway, it's just fiddling some code
around. Figured I'd mention though.

+             /*
+              * To avoid largely duplicating ReplicationSlotDropAcquired() or
+              * complicating it with already_locked flags for ProcArrayLock,
+              * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+              * just release our ReplicationSlotControlLock to drop the slot.
+              *
+              * There's no race here: we acquired this slot, and no slot "behind"
+              * our scan can be created or become active with our target dboid due
+              * to our exclusive lock on the DB.
+              */
+             LWLockRelease(ReplicationSlotControlLock);
+             ReplicationSlotDropAcquired();
+             LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

I don't see much problem with this, but I'd change the code so you
simply do a goto restart; if you released the slot. Then there's a lot
less chance / complications around temporarily releasing
ReplicationSlotControlLock

I don't quite get this. I suspect I'm just not seeing the implications
as clearly as you do.

Do you mean we should restart the whole scan of the slot array if we
drop any slot? That'll be O(n log m) but since we don't expect to be
working on a big array or a lot of slots it's unlikely to matter.

The patch coming soon will assume we'll restart the whole scan, anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#69)
1 attachment(s)
Re: Logical decoding on standby

Hi

Here's the next patch in the split-up series, drop db-specific
(logical) replication slots on DROP DATABASE.

Current behaviour is to ERROR if logical slots exist on the DB,
whether in-use or not.

With this patch we can DROP a database if it has logical slots so long
as they are not active. I haven't added any sort of syntax for this,
it's just done unconditionally.

I don't see any sensible way to stop a slot becoming active after we
check for active slots and begin the actual database DROP, since
ReplicationSlotAcquire will happily acquire a db-specific slot for a
different DB and the only lock it takes is a shared lock on
ReplicationSlotControlLock, which we presumably don't want to hold
throughout DROP DATABASE.

So this patch makes ReplicationSlotAcquire check that the slot
database matches the current database and refuse to acquire the slot
if it does not. The only sensible reason to acquire a slot from a
different DB is to drop it, and then it's only a convenience at best.
Slot drop is the only user-visible behaviour change, since all other
activity on logical slots happened when the backend was already
connected to the slot's DB. Appropriate docs changes have been made
and tests added.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Make-DROP-DATABASE-drop-logical-slots-for-the-DB.patchtext/x-patch; charset=US-ASCII; name=0001-Make-DROP-DATABASE-drop-logical-slots-for-the-DB.patchDownload
From c126a10e40aba0c39a43a97da591492d6240659c Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:21:09 +0800
Subject: [PATCH] Make DROP DATABASE drop logical slots for the DB

Automatically drop all logical replication slots associated with a
database when the database is dropped.

As a side-effect, pg_drop_replication_slot(...) may now only drop
logical slots when connected to the slot's database.
---
 contrib/test_decoding/sql/slot.sql                 |  8 ++
 doc/src/sgml/func.sgml                             |  3 +-
 doc/src/sgml/protocol.sgml                         |  2 +
 src/backend/commands/dbcommands.c                  | 32 +++++--
 src/backend/replication/logical/logical.c          | 12 +--
 src/backend/replication/slot.c                     | 97 +++++++++++++++++++++-
 src/include/replication/slot.h                     |  1 +
 src/test/recovery/t/006_logical_decoding.pl        | 34 +++++++-
 .../recovery/t/010_logical_decoding_timelines.pl   | 30 ++++++-
 9 files changed, 194 insertions(+), 25 deletions(-)

diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql
index 7ca83fe..22b22f3 100644
--- a/contrib/test_decoding/sql/slot.sql
+++ b/contrib/test_decoding/sql/slot.sql
@@ -48,3 +48,11 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_
 -- both should error as they should be dropped on error
 SELECT pg_drop_replication_slot('regression_slot1');
 SELECT pg_drop_replication_slot('regression_slot2');
+
+CREATE DATABASE testdb;
+\c testdb
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_otherdb', 'test_decoding');
+\c regression
+SELECT pg_drop_replication_slot('regression_slot_otherdb');
+\c testdb
+SELECT pg_drop_replication_slot('regression_slot_otherdb');
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ba6f8dd..78508d7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18876,7 +18876,8 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
        <entry>
         Drops the physical or logical replication slot
         named <parameter>slot_name</parameter>. Same as replication protocol
-        command <literal>DROP_REPLICATION_SLOT</>.
+        command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must
+        be called when connected to the same database the slot was created on.
        </entry>
       </row>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b3a5026..5f97141 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
      <para>
       Drops a replication slot, freeing any reserved server-side resources. If
       the slot is currently in use by an active connection, this command fails.
+      If the slot is a logical slot that was created in a database other than
+      the database the walsender is connected to, this command fails.
      </para>
      <variablelist>
       <varlistentry>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5a63b1a..7fe2c2b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -845,19 +845,22 @@ dropdb(const char *dbname, bool missing_ok)
 				 errmsg("cannot drop the currently open database")));
 
 	/*
-	 * Check whether there are, possibly unconnected, logical slots that refer
-	 * to the to-be-dropped database. The database lock we are holding
-	 * prevents the creation of new slots using the database.
+	 * Check whether there are active logical slots that refer to the
+	 * to-be-dropped database. The database lock we are holding prevents the
+	 * creation of new slots using the database or existing slots becoming
+	 * active.
 	 */
-	if (ReplicationSlotsCountDBSlots(db_id, &nslots, &nslots_active))
+	(void) ReplicationSlotsCountDBSlots(db_id, &nslots, &nslots_active);
+	if (nslots_active)
+	{
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
-			  errmsg("database \"%s\" is used by a logical replication slot",
+			  errmsg("database \"%s\" is used by an active logical replication slot",
 					 dbname),
-				 errdetail_plural("There is %d slot, %d of them active.",
-								  "There are %d slots, %d of them active.",
-								  nslots,
-								  nslots, nslots_active)));
+				 errdetail_plural("There is %d active slot",
+								  "There are %d active slots",
+								  nslots_active, nslots_active)));
+	}
 
 	/*
 	 * Check for other backends in the target database.  (Because we hold the
@@ -899,6 +902,11 @@ dropdb(const char *dbname, bool missing_ok)
 	ReleaseSysCache(tup);
 
 	/*
+	 * Drop db-specific replication slots
+	 */
+	ReplicationSlotsDropDBSlots(db_id);
+
+	/*
 	 * Delete any comments or security labels associated with the database.
 	 */
 	DeleteSharedComments(db_id, DatabaseRelationId);
@@ -2124,11 +2132,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..86a8656 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -235,11 +235,7 @@ CreateInitDecodingContext(char *plugin,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 		errmsg("cannot use physical replication slot for logical decoding")));
 
-	if (slot->data.database != MyDatabaseId)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-		   errmsg("replication slot \"%s\" was not created in this database",
-				  NameStr(slot->data.name))));
+	Assert(slot->data.database == MyDatabaseId);
 
 	if (IsTransactionState() &&
 		GetTopTransactionIdIfAny() != InvalidTransactionId)
@@ -347,11 +343,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 (errmsg("cannot use physical replication slot for logical decoding"))));
 
-	if (slot->data.database != MyDatabaseId)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-		  (errmsg("replication slot \"%s\" was not created in this database",
-				  NameStr(slot->data.name)))));
+	Assert(slot->data.database == MyDatabaseId);
 
 	if (start_lsn == InvalidXLogRecPtr)
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5237a9f..5a4eb79 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,9 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * If the slot is a logical slot it must be associated with the same
+ * database as the calling backend.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -328,6 +331,7 @@ ReplicationSlotAcquire(const char *name)
 	ReplicationSlot *slot = NULL;
 	int			i;
 	int			active_pid = 0; /* Keep compiler quiet */
+	Oid			database = InvalidOid;
 
 	Assert(MyReplicationSlot == NULL);
 
@@ -343,7 +347,9 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			database = s->data.database;
+			if (active_pid == 0 &&
+				(database == InvalidOid || database == MyDatabaseId))
 				active_pid = s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -357,6 +363,11 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+	if (database != InvalidOid && database != MyDatabaseId)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+		  (errmsg("replication slot \"%s\" was not created in this database",
+				  NameStr(slot->data.name)))));
 	if (active_pid != MyProcPid)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
@@ -796,6 +807,90 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller must hold an exclusive lock on the
+ * pg_database oid for the database to ensure no replication slots on the
+ * database are active.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing. */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * We might fail here if the slot was active. Even though we hold an
+		 * exclusive lock on the database object a logical slot for that DB can
+		 * still be active if it's being dropped by a backend connected to
+		 * another DB.
+		 *
+		 * It's an unlikely race that'll only arise from concurrent user action,
+		 * so we'll just bail out.
+		 */
+		if (active_pid)
+			elog(ERROR, "replication slot %s is in use by pid %d",
+			 	 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * For safety we'll restart our scan from the beginning each
+		 * time we release the lock.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		goto restart;
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 62cacdb..9a2dbd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 66d5e4a..510dc90 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 5;
+use Test::More tests => 15;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -54,7 +54,7 @@ my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logi
 is($stdout_sql, $expected, 'got expected output from SQL decoding session');
 
 my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
-diag "waiting to replay $endpos";
+print "waiting to replay $endpos\n";
 
 my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
 chomp($stdout_recv);
@@ -64,5 +64,35 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+$node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
+	'replaying logical slot from another database fails');
+
+is($node_master->psql('otherdb', q[SELECT pg_drop_replication_slot('test_slot');]), 3,
+	'dropping logical slot from other DB fails');
+
+$node_master->safe_psql('otherdb', qq[SELECT pg_create_logical_replication_slot('otherdb_slot', 'test_decoding');]);
+
+is($node_master->psql('postgres', 'DROP DATABASE otherdb'), 0,
+	'dropping a DB with inactive logical slots succeeds');
+
+is($node_master->slot('otherdb_slot')->{'slot_name'}, undef,
+	'logical slot was actually dropped with DB');
+
+# Restarting a node with wal_level = logical that has existing
+# slots must succeed, but decoding from those slots must fail.
+$node_master->safe_psql('postgres', 'ALTER SYSTEM SET wal_level = replica');
+is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'logical', 'wal_level is still logical before restart');
+$node_master->restart;
+is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'replica', 'wal_level is replica');
+isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
+	'restored slot catalog_xmin is nonzero');
+is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
+	'reading from slot with wal_level < logical fails');
+is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
+	'can drop logical slot while wal_level = replica');
+is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+
 # done with the node
 $node_master->stop;
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 4561a06..b618132 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -15,12 +15,15 @@
 # This module uses the first approach to show that timeline following
 # on a logical slot works.
 #
+# (For convenience, it also tests some recovery-related operations
+# on logical slots).
+#
 use strict;
 use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 10;
+use Test::More tests => 13;
 use RecursiveCopy;
 use File::Copy;
 use IPC::Run ();
@@ -50,6 +53,16 @@ $node_master->safe_psql('postgres',
 $node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
 $node_master->safe_psql('postgres',
 	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+
+# We also want to verify that DROP DATABASE on a standby with a logical
+# slot works. This isn't strictly related to timeline following, but
+# the only way to get a logical slot on a standby right now is to use
+# the same physical copy trick, so:
+$node_master->safe_psql('postgres', 'CREATE DATABASE dropme;');
+$node_master->safe_psql('dropme',
+"SELECT pg_create_logical_replication_slot('dropme_slot', 'test_decoding');"
+);
+
 $node_master->safe_psql('postgres', 'CHECKPOINT;');
 
 my $backup_name = 'b1';
@@ -68,6 +81,17 @@ $node_replica->append_conf(
 
 $node_replica->start;
 
+# If we drop 'dropme' on the master, the standby should drop the
+# db and associated slot.
+is($node_master->psql('postgres', 'DROP DATABASE dropme'), 0,
+	'dropped DB with logical slot OK on master');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+is($node_replica->safe_psql('postgres', q[SELECT 1 FROM pg_database WHERE datname = 'dropme']), '',
+	'dropped DB dropme on standby');
+is($node_master->slot('dropme_slot')->{'slot_name'}, undef,
+	'logical slot was actually dropped on standby');
+
+# Back to testing failover...
 $node_master->safe_psql('postgres',
 "SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
 );
@@ -99,10 +123,13 @@ isnt($phys_slot->{'catalog_xmin'}, '',
 cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
 	   'xmin on physical slot must not be lower than catalog_xmin');
 
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+
 # Boom, crash
 $node_master->stop('immediate');
 
 $node_replica->promote;
+print "waiting for replica to come up\n";
 $node_replica->poll_query_until('postgres',
 	"SELECT NOT pg_is_in_recovery();");
 
@@ -154,5 +181,4 @@ $stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
 chomp($stdout);
 is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
 
-# We don't need the standby anymore
 $node_replica->teardown_node();
-- 
2.5.5

#71Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#70)
1 attachment(s)
Re: Logical decoding on standby

On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote:

So this patch makes ReplicationSlotAcquire check that the slot
database matches the current database and refuse to acquire the slot
if it does not.

New patch attached that drops above requirement, so slots can still be
dropped from any DB.

This introduces a narrow race window where DROP DATABASE may ERROR if
somebody connects to a different database and runs a
pg_drop_replication_slot(...) for one of the slots being dropped by
DROP DATABASE after we check for active slots but before we've dropped
the slot. But it's hard to hit and it's pretty harmless; the worst
possible result is dropping one or more of the slots before we ERROR
out of the DROP. But you clearly didn't want them anyway, since you
were dropping the DB and dropping some slots at the same time.

I think this one's ready to go.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

drop-database-drop-slots-v2.patchtext/x-patch; charset=US-ASCII; name=drop-database-drop-slots-v2.patchDownload
From 99d5313d3a265bcc57ca6845230b9ec49d188710 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:21:09 +0800
Subject: [PATCH] Make DROP DATABASE drop logical slots for the DB

Automatically drop all logical replication slots associated with a
database when the database is dropped.
---
 doc/src/sgml/func.sgml                             |  3 +-
 doc/src/sgml/protocol.sgml                         |  2 +
 src/backend/commands/dbcommands.c                  | 32 +++++---
 src/backend/replication/slot.c                     | 88 ++++++++++++++++++++++
 src/include/replication/slot.h                     |  1 +
 src/test/recovery/t/006_logical_decoding.pl        | 40 +++++++++-
 .../recovery/t/010_logical_decoding_timelines.pl   | 30 +++++++-
 7 files changed, 182 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ba6f8dd..78508d7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18876,7 +18876,8 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
        <entry>
         Drops the physical or logical replication slot
         named <parameter>slot_name</parameter>. Same as replication protocol
-        command <literal>DROP_REPLICATION_SLOT</>.
+        command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must
+        be called when connected to the same database the slot was created on.
        </entry>
       </row>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b3a5026..5f97141 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
      <para>
       Drops a replication slot, freeing any reserved server-side resources. If
       the slot is currently in use by an active connection, this command fails.
+      If the slot is a logical slot that was created in a database other than
+      the database the walsender is connected to, this command fails.
      </para>
      <variablelist>
       <varlistentry>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5a63b1a..c0ba2b4 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -845,19 +845,22 @@ dropdb(const char *dbname, bool missing_ok)
 				 errmsg("cannot drop the currently open database")));
 
 	/*
-	 * Check whether there are, possibly unconnected, logical slots that refer
-	 * to the to-be-dropped database. The database lock we are holding
-	 * prevents the creation of new slots using the database.
+	 * Check whether there are active logical slots that refer to the
+	 * to-be-dropped database. The database lock we are holding prevents the
+	 * creation of new slots using the database or existing slots becoming
+	 * active.
 	 */
-	if (ReplicationSlotsCountDBSlots(db_id, &nslots, &nslots_active))
+	(void) ReplicationSlotsCountDBSlots(db_id, &nslots, &nslots_active);
+	if (nslots_active)
+	{
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
-			  errmsg("database \"%s\" is used by a logical replication slot",
+			  errmsg("database \"%s\" is used by an active logical replication slot",
 					 dbname),
-				 errdetail_plural("There is %d slot, %d of them active.",
-								  "There are %d slots, %d of them active.",
-								  nslots,
-								  nslots, nslots_active)));
+				 errdetail_plural("There is %d active slot",
+								  "There are %d active slots",
+								  nslots_active, nslots_active)));
+	}
 
 	/*
 	 * Check for other backends in the target database.  (Because we hold the
@@ -915,6 +918,11 @@ dropdb(const char *dbname, bool missing_ok)
 	dropDatabaseDependencies(db_id);
 
 	/*
+	 * Drop db-specific replication slots.
+	 */
+	ReplicationSlotsDropDBSlots(db_id);
+
+	/*
 	 * Drop pages for this database that are in the shared buffer cache. This
 	 * is important to ensure that no remaining backend tries to write out a
 	 * dirty buffer to the dead database later...
@@ -2124,11 +2132,17 @@ dbase_redo(XLogReaderState *record)
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
+			 *
+			 * This will lock out walsenders trying to connect to db-specific
+			 * slots for logical decoding too, so it's safe for us to drop slots.
 			 */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
 		}
 
+		/* Drop any database-specific replication slots */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5237a9f..d075eda 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -796,6 +796,94 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+/*
+ * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
+ * passed database oid. The caller should hold an exclusive lock on the
+ * pg_database oid for the database to prevent creation of new slots on the db
+ * or replay from existing slots.
+ *
+ * This routine isn't as efficient as it could be - but we don't drop databases
+ * often, especially databases with lots of slots.
+ *
+ * Another session that concurrently acquires an existing slot on the target DB
+ * (most likely to drop it) may cause this function to ERROR. If that happens
+ * it may have dropped some but not all slots.
+ */
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing. */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * We might fail here if the slot was active. Even though we hold an
+		 * exclusive lock on the database object a logical slot for that DB can
+		 * still be active if it's being dropped by a backend connected to
+		 * another DB or is otherwise acquired.
+		 *
+		 * It's an unlikely race that'll only arise from concurrent user action,
+		 * so we'll just bail out.
+		 */
+		if (active_pid)
+			elog(ERROR, "replication slot %s is in use by pid %d",
+			 	 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * For safety we'll restart our scan from the beginning each
+		 * time we release the lock.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		goto restart;
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+}
+
 
 /*
  * Check whether the server's configuration supports using replication
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 62cacdb..9a2dbd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -177,6 +177,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 66d5e4a..bf9b50a 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 5;
+use Test::More tests => 16;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -54,7 +54,7 @@ my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logi
 is($stdout_sql, $expected, 'got expected output from SQL decoding session');
 
 my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
-diag "waiting to replay $endpos";
+print "waiting to replay $endpos\n";
 
 my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
 chomp($stdout_recv);
@@ -64,5 +64,41 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+$node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
+	'replaying logical slot from another database fails');
+
+$node_master->safe_psql('otherdb', qq[SELECT pg_create_logical_replication_slot('otherdb_slot', 'test_decoding');]);
+
+# make sure you can't drop a slot while active
+my $pg_recvlogical = IPC::Run::start(['pg_recvlogical', '-d', $node_master->connstr('otherdb'), '-S', 'otherdb_slot', '-f', '-', '--start']);
+$node_master->poll_query_until('otherdb', "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'otherdb_slot' AND active_pid IS NOT NULL)");
+is($node_master->psql('postgres', 'DROP DATABASE otherdb'), 3,
+	'dropping a DB with inactive logical slots fails');
+$pg_recvlogical->kill_kill;
+is($node_master->slot('otherdb_slot')->{'slot_name'}, undef,
+	'logical slot still exists');
+
+$node_master->poll_query_until('otherdb', "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'otherdb_slot' AND active_pid IS NULL)");
+is($node_master->psql('postgres', 'DROP DATABASE otherdb'), 0,
+	'dropping a DB with inactive logical slots succeeds');
+is($node_master->slot('otherdb_slot')->{'slot_name'}, undef,
+	'logical slot was actually dropped with DB');
+
+# Restarting a node with wal_level = logical that has existing
+# slots must succeed, but decoding from those slots must fail.
+$node_master->safe_psql('postgres', 'ALTER SYSTEM SET wal_level = replica');
+is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'logical', 'wal_level is still logical before restart');
+$node_master->restart;
+is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'replica', 'wal_level is replica');
+isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
+	'restored slot catalog_xmin is nonzero');
+is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
+	'reading from slot with wal_level < logical fails');
+is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
+	'can drop logical slot while wal_level = replica');
+is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+
 # done with the node
 $node_master->stop;
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 4561a06..b618132 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -15,12 +15,15 @@
 # This module uses the first approach to show that timeline following
 # on a logical slot works.
 #
+# (For convenience, it also tests some recovery-related operations
+# on logical slots).
+#
 use strict;
 use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 10;
+use Test::More tests => 13;
 use RecursiveCopy;
 use File::Copy;
 use IPC::Run ();
@@ -50,6 +53,16 @@ $node_master->safe_psql('postgres',
 $node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
 $node_master->safe_psql('postgres',
 	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+
+# We also want to verify that DROP DATABASE on a standby with a logical
+# slot works. This isn't strictly related to timeline following, but
+# the only way to get a logical slot on a standby right now is to use
+# the same physical copy trick, so:
+$node_master->safe_psql('postgres', 'CREATE DATABASE dropme;');
+$node_master->safe_psql('dropme',
+"SELECT pg_create_logical_replication_slot('dropme_slot', 'test_decoding');"
+);
+
 $node_master->safe_psql('postgres', 'CHECKPOINT;');
 
 my $backup_name = 'b1';
@@ -68,6 +81,17 @@ $node_replica->append_conf(
 
 $node_replica->start;
 
+# If we drop 'dropme' on the master, the standby should drop the
+# db and associated slot.
+is($node_master->psql('postgres', 'DROP DATABASE dropme'), 0,
+	'dropped DB with logical slot OK on master');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
+is($node_replica->safe_psql('postgres', q[SELECT 1 FROM pg_database WHERE datname = 'dropme']), '',
+	'dropped DB dropme on standby');
+is($node_master->slot('dropme_slot')->{'slot_name'}, undef,
+	'logical slot was actually dropped on standby');
+
+# Back to testing failover...
 $node_master->safe_psql('postgres',
 "SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
 );
@@ -99,10 +123,13 @@ isnt($phys_slot->{'catalog_xmin'}, '',
 cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
 	   'xmin on physical slot must not be lower than catalog_xmin');
 
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+
 # Boom, crash
 $node_master->stop('immediate');
 
 $node_replica->promote;
+print "waiting for replica to come up\n";
 $node_replica->poll_query_until('postgres',
 	"SELECT NOT pg_is_in_recovery();");
 
@@ -154,5 +181,4 @@ $stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
 chomp($stdout);
 is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
 
-# We don't need the standby anymore
 $node_replica->teardown_node();
-- 
2.5.5

#72Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#71)
Re: Logical decoding on standby

On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote:

I think this one's ready to go.

Looks like something I could commit. Full review by me while offline
today, aiming to commit tomorrow barring issues raised.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#72)
Re: Logical decoding on standby

On 27 March 2017 at 16:20, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote:

I think this one's ready to go.

Looks like something I could commit. Full review by me while offline
today, aiming to commit tomorrow barring issues raised.

Great.

Meanwhile I'm going to be trying to work with Stas on 2PC logical
decoding, while firming up the next patches in this series to see if
we can progress a bit further.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#71)
Re: Logical decoding on standby

Hi,

On 2017-03-27 16:03:48 +0800, Craig Ringer wrote:

On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote:

So this patch makes ReplicationSlotAcquire check that the slot
database matches the current database and refuse to acquire the slot
if it does not.

New patch attached that drops above requirement, so slots can still be
dropped from any DB.

This introduces a narrow race window where DROP DATABASE may ERROR if
somebody connects to a different database and runs a
pg_drop_replication_slot(...) for one of the slots being dropped by
DROP DATABASE after we check for active slots but before we've dropped
the slot. But it's hard to hit and it's pretty harmless; the worst
possible result is dropping one or more of the slots before we ERROR
out of the DROP. But you clearly didn't want them anyway, since you
were dropping the DB and dropping some slots at the same time.

I think this one's ready to go.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From 99d5313d3a265bcc57ca6845230b9ec49d188710 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:21:09 +0800
Subject: [PATCH] Make DROP DATABASE drop logical slots for the DB

Automatically drop all logical replication slots associated with a
database when the database is dropped.
---
doc/src/sgml/func.sgml | 3 +-
doc/src/sgml/protocol.sgml | 2 +
src/backend/commands/dbcommands.c | 32 +++++---
src/backend/replication/slot.c | 88 ++++++++++++++++++++++
src/include/replication/slot.h | 1 +
src/test/recovery/t/006_logical_decoding.pl | 40 +++++++++-
.../recovery/t/010_logical_decoding_timelines.pl | 30 +++++++-
7 files changed, 182 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ba6f8dd..78508d7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18876,7 +18876,8 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
<entry>
Drops the physical or logical replication slot
named <parameter>slot_name</parameter>. Same as replication protocol
-        command <literal>DROP_REPLICATION_SLOT</>.
+        command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must
+        be called when connected to the same database the slot was created on.
</entry>
</row>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b3a5026..5f97141 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
<para>
Drops a replication slot, freeing any reserved server-side resources. If
the slot is currently in use by an active connection, this command fails.
+      If the slot is a logical slot that was created in a database other than
+      the database the walsender is connected to, this command fails.
</para>
<variablelist>
<varlistentry>

Shouldn't the docs in the drop database section about this?

+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData slotname;
+		int active_pid;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* only logical slots are database specific, skip */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != dboid)
+			continue;
+
+		/* Claim the slot, as if ReplicationSlotAcquire()ing. */
+		SpinLockAcquire(&s->mutex);
+		strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
+		NameStr(slotname)[NAMEDATALEN-1] = '\0';
+		active_pid = s->active_pid;
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+		}
+		SpinLockRelease(&s->mutex);
+
+		/*
+		 * We might fail here if the slot was active. Even though we hold an
+		 * exclusive lock on the database object a logical slot for that DB can
+		 * still be active if it's being dropped by a backend connected to
+		 * another DB or is otherwise acquired.
+		 *
+		 * It's an unlikely race that'll only arise from concurrent user action,
+		 * so we'll just bail out.
+		 */
+		if (active_pid)
+			elog(ERROR, "replication slot %s is in use by pid %d",
+			 	 NameStr(slotname), active_pid);
+
+		/*
+		 * To avoid largely duplicating ReplicationSlotDropAcquired() or
+		 * complicating it with already_locked flags for ProcArrayLock,
+		 * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
+		 * just release our ReplicationSlotControlLock to drop the slot.
+		 *
+		 * For safety we'll restart our scan from the beginning each
+		 * time we release the lock.
+		 */
+		LWLockRelease(ReplicationSlotControlLock);
+		ReplicationSlotDropAcquired();
+		goto restart;
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	/* recompute limits once after all slots are dropped */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();

I was concerned for a second that we'd skip doing
ReplicationSlotsComputeRequired* if we ERROR out above - but
ReplicationSlotDropAcquired already calls these as necessary. I.e. they
should be dropped from here.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#74)
Re: Logical decoding on standby

On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:

--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
<para>
Drops a replication slot, freeing any reserved server-side resources. If
the slot is currently in use by an active connection, this command fails.
+      If the slot is a logical slot that was created in a database other than
+      the database the walsender is connected to, this command fails.
</para>
<variablelist>
<varlistentry>

Shouldn't the docs in the drop database section about this?

DROP DATABASE doesn't really discuss all the resources it drops, but
I'm happy to add mention of replication slots handling.

I just notice that I failed to remove the docs changes regarding
dropping slots becoming db-specific, so I'll post a follow-up for that
in a sec.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#75)
1 attachment(s)
Re: Logical decoding on standby

On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote:

I just notice that I failed to remove the docs changes regarding
dropping slots becoming db-specific, so I'll post a follow-up for that
in a sec.

Attached.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

fix-slot-drop-docs.patchtext/x-patch; charset=US-ASCII; name=fix-slot-drop-docs.patchDownload
From 5fe01aef643905ec1f6dcffd0f5d583809fc9a21 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 29 Mar 2017 08:03:06 +0800
Subject: [PATCH] Documentation amendments for slot drop on db drop

The "Cleanup slots during drop database" patch incorrectly documented that
dropping logical slots must now be done from the database the slot was created
on. This was the case in an earlier variant of the patch, but not the committed
version.

Also document that idle logical replication slots will be dropped by
DROP DATABASE.
---
 doc/src/sgml/func.sgml              | 3 +--
 doc/src/sgml/protocol.sgml          | 2 --
 doc/src/sgml/ref/drop_database.sgml | 7 +++++++
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78508d7..ba6f8dd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18876,8 +18876,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
        <entry>
         Drops the physical or logical replication slot
         named <parameter>slot_name</parameter>. Same as replication protocol
-        command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must
-        be called when connected to the same database the slot was created on.
+        command <literal>DROP_REPLICATION_SLOT</>.
        </entry>
       </row>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5f97141..b3a5026 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2034,8 +2034,6 @@ The commands accepted in walsender mode are:
      <para>
       Drops a replication slot, freeing any reserved server-side resources. If
       the slot is currently in use by an active connection, this command fails.
-      If the slot is a logical slot that was created in a database other than
-      the database the walsender is connected to, this command fails.
      </para>
      <variablelist>
       <varlistentry>
diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml
index 740aa31..3427139 100644
--- a/doc/src/sgml/ref/drop_database.sgml
+++ b/doc/src/sgml/ref/drop_database.sgml
@@ -81,6 +81,13 @@ DROP DATABASE [ IF EXISTS ] <replaceable class="PARAMETER">name</replaceable>
    <xref linkend="app-dropdb"> instead,
    which is a wrapper around this command.
   </para>
+
+  <para>
+   Active <link linkend="logicaldecoding-replication-slots">logical
+   replication slots</> count as connections and will prevent a
+   database from being dropped. Inactive slots will be automatically
+   dropped when the database is dropped.
+  </para>
  </refsect1>
 
  <refsect1>
-- 
2.5.5

#77Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#76)
1 attachment(s)
Re: Logical decoding on standby

On 29 March 2017 at 08:11, Craig Ringer <craig@2ndquadrant.com> wrote:

On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote:

I just notice that I failed to remove the docs changes regarding
dropping slots becoming db-specific, so I'll post a follow-up for that
in a sec.

Attached.

... and here's the next in the patch series. Both this and the
immediately prior minor patch fix-drop-slot-docs.patch are pending
now.

Notable changes in this patch since review:

* Split oldestCatalogXmin tracking into separate patch

* Critically, fix use of procArray->replication_slot_catalog_xmin in
GetSnapshotData's setting of RecentGlobalXmin and RecentGlobalDataXmin
so it instead uses ShmemVariableCache->oldestCatalogXmin . This
could've led to tuples newer than oldestCatalogXmin being removed.

* Memory barrier in UpdateOldestCatalogXmin and SetOldestCatalogXmin.
It still does a pre-check before deciding if it needs to take
ProcArrayLock, recheck, and advance, since we don't want to
unnecessarily contest ProcArrayLock.

* Remove unnecessary volatile usage (retained in
UpdateOldestCatalogXmin due to barrier)

* Remove unnecessary test for XLogInsertAllowed() in XactLogCatalogXminUpdate

* EnsureActiveLogicalSlotValid(void) - add (void)

* pgidented changes in this diff; have left unrelated changes alone

Re:

what does

+       TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+                                                                         * is guaranteed to still exist */

mean? I complained about the overall justification in the commit
already, but looking at this commit alone, the justification for this
part of the change is quite hard to understand.

The patch now contains

TransactionId oldestCatalogXmin; /* oldest xid it is guaranteed to be safe
* to create a historic snapshot for; see
* also
* procArray->replication_slot_catalog_xmin
* */

which I think is an improvement.

I've also sought to explain the purpose of this change better with

/*
* If necessary, copy the current catalog_xmin needed by replication slots to
* the effective catalog_xmin used for dead tuple removal and write a WAL
* record recording the change.
*
* This allows standbys to know the oldest xid for which it is safe to create
* a historic snapshot for logical decoding. VACUUM or other cleanup may have
* removed catalog tuple versions needed to correctly decode transactions older
* than this threshold. Standbys can use this information to cancel conflicting
* decoding sessions and invalidate slots that need discarded information.
*
* (We can't use the transaction IDs in WAL records emitted by VACUUM etc for
* this, since they don't identify the relation as a catalog or not. Nor can a
* standby look up the relcache to get the Relation for the affected
* relfilenode to check if it is a catalog. The standby would also have no way
* to know the oldest safe position at startup if it wasn't in the control
* file.)
*/
void
UpdateOldestCatalogXmin(void)
{
...

Does that help?

(Sidenote for later: ResolveRecoveryConflictWithLogicalDecoding will
need a read barrier too, when the next patch adds it.)

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

log-catalog-xmin-advances-v2.patchtext/x-patch; charset=US-ASCII; name=log-catalog-xmin-advances-v2.patchDownload
From 4b8e3aaa52539ef8cf3c79d1ed0319cc44800a32 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:36:49 +0800
Subject: [PATCH] Log catalog_xmin advances before removing catalog tuples

Write a WAL record before advancing the oldest catalog_xmin preserved by
VACUUM and other tuple removal.

Previously GetOldestXmin would use procArray->replication_slot_catalog_xmin as
the xid limit for vacuuming catalog tuples, so it was not possible for standbys
to determine whether all catalog tuples needed for a catalog snapshot for a
given xid would still exist.

Logging catalog_xmin advances allows standbys to determine if a logical slot on
the standby has become unsafe to use. It can then refuse to start logical
decoding on that slot or, if decoding is in progress, raise a conflict with
recovery.

Note that we only emit new WAL records if catalog_xmin changes, which happens
due to changes in slot state. So this won't generate WAL whenever oldestXmin
advances.
---
 src/backend/access/heap/rewriteheap.c       |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c      |   9 +++
 src/backend/access/transam/varsup.c         |  14 ++++
 src/backend/access/transam/xact.c           |  35 ++++++++
 src/backend/access/transam/xlog.c           |  12 ++-
 src/backend/commands/vacuum.c               |   9 +++
 src/backend/postmaster/bgwriter.c           |  10 +++
 src/backend/replication/logical/decode.c    |  11 +++
 src/backend/replication/logical/logical.c   |  38 +++++++++
 src/backend/replication/walreceiver.c       |   2 +-
 src/backend/replication/walsender.c         |  13 +++
 src/backend/storage/ipc/procarray.c         | 119 +++++++++++++++++++++++++---
 src/bin/pg_controldata/pg_controldata.c     |   2 +
 src/include/access/transam.h                |   6 ++
 src/include/access/xact.h                   |  12 ++-
 src/include/catalog/pg_control.h            |   1 +
 src/include/storage/procarray.h             |   5 +-
 src/test/recovery/t/006_logical_decoding.pl |  12 ++-
 18 files changed, 294 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..36bbb98 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use the catalog_xmin being retained by vacuum */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..96ea163 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5efbfbd..6cf939f 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -414,6 +414,20 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	pg_write_barrier();
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..0e3b870 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5652,6 +5652,41 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 *
+		 * Existing sessions are not notified and must check the safe xmin.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	xl_xact_catalog_xmin_advance xlrec;
+	xlrec.new_catalog_xmin = new_catalog_xmin;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+	return XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..a3ac2c1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5008,6 +5008,7 @@ BootStrapXLOG(void)
 	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
+	checkPoint.oldestCatalogXmin = InvalidTransactionId;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
 	checkPoint.oldestCommitTsXid = InvalidTransactionId;
@@ -5021,6 +5022,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6613,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6633,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8702,6 +8708,7 @@ CreateCheckPoint(int flags)
 	checkPoint.nextXid = ShmemVariableCache->nextXid;
 	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
 	LWLockRelease(XidGenLock);
 
 	LWLockAcquire(CommitTsLock, LW_SHARED);
@@ -9631,6 +9638,7 @@ xlog_redo(XLogReaderState *record)
 		 * redo an xl_clog_truncate if it changed since initialization.
 		 */
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9729,8 +9737,8 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 9fbb0eb..ae41dc3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin();
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..df239e0 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,15 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not
+		 * a standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly, even if we haven't had a
+		 * recent vacuum run.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin();
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..07a120d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..28d04d1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -68,6 +68,8 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -126,6 +128,8 @@ StartupDecodingContext(List *output_plugin_options,
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
+	EnsureActiveLogicalSlotValid();
+
 	context = AllocSetContextCreate(CurrentMemoryContext,
 									"Logical decoding context",
 									ALLOCSET_DEFAULT_SIZES);
@@ -963,3 +967,37 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid(void)
+{
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * Currently a logical can only become unusable if we're doing logical
+	 * decoding on standby and the master advanced its catalog_xmin past the
+	 * threshold we need, removing tuples that we'll require to start decoding
+	 * at our restart_lsn.
+	 */
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check if enough catalog is retained for this slot. No locking is
+		 * needed here since oldestCatalogXmin can only advance, so if it's
+		 * past what we need that's not going to change. We have marked our
+		 * slot as active so redo won't replay past our catalog_xmin without
+		 * first terminating our session.
+		 */
+		TransactionId shmem_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+		if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+			TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("replication slot '%s' requires catalogs removed by master",
+							NameStr(MyReplicationSlot->data.name))));
+	}
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 771ac30..c2ad791 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1233,7 +1233,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..cdc5f95 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1778,6 +1783,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..381c230 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,11 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -1306,6 +1310,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1444,6 +1451,79 @@ GetOldestXmin(Relation rel, int flags)
 }
 
 /*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by replication slots to
+ * the effective catalog_xmin used for dead tuple removal and write a WAL
+ * record recording the change.
+ *
+ * This allows standbys to know the oldest xid for which it is safe to create
+ * a historic snapshot for logical decoding. VACUUM or other cleanup may have
+ * removed catalog tuple versions needed to correctly decode transactions older
+ * than this threshold. Standbys can use this information to cancel conflicting
+ * decoding sessions and invalidate slots that need discarded information.
+ *
+ * (We can't use the transaction IDs in WAL records emitted by VACUUM etc for
+ * this, since they don't identify the relation as a catalog or not.  Nor can a
+ * standby look up the relcache to get the Relation for the affected
+ * relfilenode to check if it is a catalog. The standby would also have no way
+ * to know the oldest safe position at startup if it wasn't in the control
+ * file.)
+ */
+void
+UpdateOldestCatalogXmin(void)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	Assert(XLogInsertAllowed());
+
+	/*
+	 * Do an unlocked check to see if there's a new catalog_xmin in procarray,
+	 * so we can avoid taking a lock and writing xlog if they're unchanged,
+	 * as is most likely.
+	 *
+	 * The read barrier is for oldestCatalogXmin, we don't care whether we see
+	 * the very latest replication_slot_catalog_xmin or not.
+	 */
+	pg_read_barrier();
+	vacuum_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+	slots_catalog_xmin = procArray->replication_slot_catalog_xmin;
+
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+	{
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+		/*
+		 * A concurrent updater could've changed the oldestCatalogXmin so we
+		 * need to re-check under ProcArrayLock before updating. The LWLock
+		 * provides a barrier.
+		 *
+		 * We must not re-read replication_slot_catalog_xmin even if it has
+		 * advanced, since we xlog'd the older value. A later check will
+		 * advance it again.
+		 */
+		vacuum_catalog_xmin = *((volatile TransactionId *) &ShmemVariableCache->oldestCatalogXmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			SetOldestCatalogXmin(slots_catalog_xmin);
+		LWLockRelease(ProcArrayLock);
+	}
+}
+
+/*
  * GetMaxSnapshotXidCount -- get max size for snapshot XID array
  *
  * We have to export this for use by snapmgr.c.
@@ -1493,7 +1573,8 @@ GetMaxSnapshotSubxidCount(void)
  *			older than this are known not running any more.
  *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
  *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by GetOldestXmin(true, true).
+ *			the same computation done by
+ *			GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT|PROCARRAY_FLAGS_VACUUM)
  *		RecentGlobalDataXmin: the global xmin for non-catalog tables
  *			>= RecentGlobalXmin
  *
@@ -1700,7 +1781,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1792,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2168,14 +2252,14 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by an existing
+	 * replication slot it's definitely safe to start there, and it can't
+	 * advance while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2965,18 +3049,29 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..fe5e67c 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,11 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+	TransactionId oldestCatalogXmin; /* oldest xid it is guaranteed to be safe
+									  * to create a historic snapshot for; see
+									  * also
+									  * procArray->replication_slot_catalog_xmin
+									  * */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -180,6 +185,7 @@ extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
 extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..6d18d18 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -137,7 +137,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -187,6 +187,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+}	xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -391,6 +398,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   int xactflags, TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..1fe89ae 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..69a82d7 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(void);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..f38b38a 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 25;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -17,6 +17,10 @@ $node_master->append_conf(
 wal_level = logical
 ));
 $node_master->start;
+
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after start");
+
 my $backup_name = 'master_backup';
 
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -96,9 +100,15 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+	"pg_controldata's oldestCatalogXmin is nonzero");
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint");
 
 # done with the node
 $node_master->stop;
-- 
2.5.5

#78Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#77)
Re: Logical decoding on standby

On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:

* Split oldestCatalogXmin tracking into separate patch

Regarding this, Simon raised concerns about xlog volume here.

It's pretty negligible.

We only write a new record when a vacuum runs after catalog_xmin
advances on the slot with the currently-lowest catalog_xmin (or, if
vacuum doesn't run reasonably soon, when the bgworker next looks).

So at worst on a fairly slow moving system or one with a super high
vacuum rate we'll write one per commit. But in most cases we'll write
a lot fewer than that. When running t/006_logical_decoding.pl for
example:

$ ../../../src/bin/pg_waldump/pg_waldump
tmp_check/data_master_daPa/pgdata/pg_wal/000000010000000000000001 |
grep CATALOG
rmgr: Transaction len (rec/tot): 4/ 30, tx: 0, lsn:
0/01648D50, prev 0/01648D18, desc: CATALOG_XMIN catalog_xmin 555
rmgr: Transaction len (rec/tot): 4/ 30, tx: 0, lsn:
0/0164C840, prev 0/0164C378, desc: CATALOG_XMIN catalog_xmin 0
pg_waldump: FATAL: error in WAL record at 0/16BBF10: invalid record
length at 0/16BBF88: wanted 24, got 0

and of course, none at all unless you use logical decoding.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#78)
Re: Logical decoding on standby

On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:

On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:

* Split oldestCatalogXmin tracking into separate patch

Regarding this, Simon raised concerns about xlog volume here.

It's pretty negligible.

We only write a new record when a vacuum runs after catalog_xmin
advances on the slot with the currently-lowest catalog_xmin (or, if
vacuum doesn't run reasonably soon, when the bgworker next looks).

I'd prefer to slow things down a little, not be so eager.

If we hold back update of the catalog_xmin until when we run
GetRunningTransactionData() we wouldn't need to produce any WAL
records at all AND we wouldn't need to have VACUUM do
UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
task.

That would also make this patch about half the length it is.

Let me know what you think.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#79)
Re: Logical decoding on standby

On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:

On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:

* Split oldestCatalogXmin tracking into separate patch

Regarding this, Simon raised concerns about xlog volume here.

It's pretty negligible.

We only write a new record when a vacuum runs after catalog_xmin
advances on the slot with the currently-lowest catalog_xmin (or, if
vacuum doesn't run reasonably soon, when the bgworker next looks).

I'd prefer to slow things down a little, not be so eager.

If we hold back update of the catalog_xmin until when we run
GetRunningTransactionData() we wouldn't need to produce any WAL
records at all AND we wouldn't need to have VACUUM do
UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
task.

That would also make this patch about half the length it is.

Let me know what you think.

Good idea.

We can always add a heuristic later to make xl_running_xacts get
emitted more often at high transaction rates if it's necessary.

Patch coming soon.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#80)
1 attachment(s)
Re: Logical decoding on standby

On 30 March 2017 at 11:34, Craig Ringer <craig@2ndquadrant.com> wrote:

On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:

On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:

* Split oldestCatalogXmin tracking into separate patch

Regarding this, Simon raised concerns about xlog volume here.

It's pretty negligible.

We only write a new record when a vacuum runs after catalog_xmin
advances on the slot with the currently-lowest catalog_xmin (or, if
vacuum doesn't run reasonably soon, when the bgworker next looks).

I'd prefer to slow things down a little, not be so eager.

If we hold back update of the catalog_xmin until when we run
GetRunningTransactionData() we wouldn't need to produce any WAL
records at all AND we wouldn't need to have VACUUM do
UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
task.

That would also make this patch about half the length it is.

Let me know what you think.

Good idea.

We can always add a heuristic later to make xl_running_xacts get
emitted more often at high transaction rates if it's necessary.

Patch coming soon.

Attached.

A bit fiddlier than expected, but I like the result more.

In the process I identified an issue with both the prior patch and
this one where we don't check slot validity for slots that existed on
standby prior to promotion of standby to master. We were just assuming
that being the master was good enough, since it controls
replication_slot_catalog_xmin, but that's not true for pre-existing
slots.

Fixed by forcing update of the persistent safe catalog xmin after the
first slot is created on the master - which is now done by doing an
immediate LogStandbySnapshot() after assigning the slot's
catalog_xmin.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

log-catalog-xmin-advances-v3.patchtext/x-patch; charset=US-ASCII; name=log-catalog-xmin-advances-v3.patchDownload
From 0df4f4ae04f8d37c623d3a533699966c3cc0479a Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:36:49 +0800
Subject: [PATCH v2] Log catalog_xmin advances before removing catalog tuples

Before advancing the effective catalog_xmin we use to remove old catalog
tuple versions, make sure it is written to WAL. This allows standbys
to know the oldest xid they can safely create a historic snapshot for.
They can then refuse to start decoding from a slot or raise a recovery
conflict.

The catalog_xmin advance is logged in the next xl_running_xacts records, so
vacuum of catalogs may be held back up to 10 seconds when a replication slot
with catalog_xmin is holding down the global catalog_xmin.
---
 src/backend/access/heap/rewriteheap.c       |  3 +-
 src/backend/access/rmgrdesc/standbydesc.c   |  5 ++-
 src/backend/access/transam/varsup.c         |  1 -
 src/backend/access/transam/xlog.c           | 26 ++++++++++-
 src/backend/replication/logical/logical.c   | 54 +++++++++++++++++++++++
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 13 ++++++
 src/backend/storage/ipc/procarray.c         | 68 +++++++++++++++++++++++------
 src/backend/storage/ipc/standby.c           | 25 +++++++++++
 src/bin/pg_controldata/pg_controldata.c     |  2 +
 src/include/access/transam.h                | 11 +++++
 src/include/catalog/pg_control.h            |  1 +
 src/include/storage/procarray.h             |  3 +-
 src/include/storage/standby.h               |  6 +++
 src/include/storage/standbydefs.h           |  1 +
 src/test/recovery/t/006_logical_decoding.pl | 15 ++++++-
 16 files changed, 214 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..d1400ec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use oldestCatalogXmin here */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 278546a..4aaae59 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -21,10 +21,11 @@ standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
 	int			i;
 
-	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
+	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u oldestCatalogXmin %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
-					 xlrec->oldestRunningXid);
+					 xlrec->oldestRunningXid,
+					 xlrec->oldestCatalogXmin);
 	if (xlrec->xcnt > 0)
 	{
 		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5efbfbd..4babdf9 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -414,7 +414,6 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
-
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
  *
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..19e0116 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5021,6 +5021,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6612,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6632,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8704,6 +8709,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+
 	LWLockAcquire(CommitTsLock, LW_SHARED);
 	checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
 	checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
@@ -9633,6 +9642,12 @@ xlog_redo(XLogReaderState *record)
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 
 		/*
+		 * There can be no concurrent writers to oldestCatalogXmin during
+		 * recovery, so no need to take ProcArrayLock.
+		 */
+		ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
+
+		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
 		 * record, the backup was canceled and the end-of-backup record will
 		 * never arrive.
@@ -9729,8 +9744,15 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+
+
+		/*
+		 * There can be no concurrent writers to oldestCatalogXmin during
+		 * recovery, so no need to take ProcArrayLock.
+		 */
+		ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..76155bf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -68,6 +68,8 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -218,6 +220,7 @@ CreateInitDecodingContext(char *plugin,
 	ReplicationSlot *slot;
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
+	bool force_standby_snapshot;
 
 	/* shorter lines... */
 	slot = MyReplicationSlot;
@@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	/*
+	 * If this is the first slot created on the master we won't have a
+	 * persistent record of the oldest safe xid for historic snapshots yet.
+	 * Force one to be recorded so that when we go to replay from this slot we
+	 * know it's safe.
+	 */
+	force_standby_snapshot =
+		!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
+
 	LWLockRelease(ProcArrayLock);
 
+	/* Update ShmemVariableCache->oldestCatalogXmin */
+	if (force_standby_snapshot)
+		LogStandbySnapshot();
+
 	/*
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
@@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+	EnsureActiveLogicalSlotValid();
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId,
 								 read_page, prepare_write, do_write);
@@ -963,3 +981,39 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid(void)
+{
+	TransactionId shmem_catalog_xmin;
+
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * A logical slot can become unusable if we're doing logical decoding on a
+	 * standby or using a slot created before we were promoted from standby
+	 * to master. If the master advanced its global catalog_xmin past the
+	 * threshold we need it could've removed catalog tuple versions that
+	 * we'll require to start decoding at our restart_lsn.
+	 *
+	 * We need a barrier so that if we decode in recovery on a standby we
+	 * don't allow new decoding sessions to start after redo has advanced
+	 * the threshold.
+	 */
+	if (RecoveryInProgress())
+		pg_memory_barrier();
+
+	shmem_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+		TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("replication slot '%s' requires catalogs removed by master",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("need catalog_xmin %u, have oldestCatalogXmin %u",
+						   MyReplicationSlot->data.catalog_xmin, shmem_catalog_xmin)));
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 771ac30..c2ad791 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1233,7 +1233,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..cdc5f95 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1778,6 +1783,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..a5b26dd 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,11 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -679,6 +683,18 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
 
 	/*
+	 * Update our knowledge of the oldest xid we can safely create historic
+	 * snapshots for.
+	 *
+	 * There can be no concurrent writers to oldestCatalogXmin during
+	 * recovery, so no need to take ProcArrayLock.
+	 *
+	 * If we allow logical decoding on standbys in future we must raise
+	 * recovery conflicts with catalog_xmin advances here.
+	 */
+	ShmemVariableCache->oldestCatalogXmin = running->pendingOldestCatalogXmin;
+
+	/*
 	 * Remove stale locks, if any.
 	 *
 	 * Locks are always assigned to the toplevel xid so we don't need to care
@@ -1306,6 +1322,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1493,7 +1512,8 @@ GetMaxSnapshotSubxidCount(void)
  *			older than this are known not running any more.
  *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
  *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by GetOldestXmin(true, true).
+ *			the same computation done by
+ *			GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT|PROCARRAY_FLAGS_VACUUM)
  *		RecentGlobalDataXmin: the global xmin for non-catalog tables
  *			>= RecentGlobalXmin
  *
@@ -1700,7 +1720,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1731,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2041,12 +2064,16 @@ GetRunningTransactionData(void)
 	}
 
 	/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
 	 * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
 	 * were to be included here the initial value could never increase because
 	 * of a circular dependency where slots only increase their limits when
 	 * running xacts increases oldestRunningXid and running xacts only
 	 * increases if slots do.
+	 *
+	 * We can include the catalog_xmin limit here; there's no similar
+	 * circularity, and we need it to log xl_running_xacts records for
+	 * standbys.
 	 */
 
 	CurrentRunningXacts->xcnt = count - subcount;
@@ -2055,6 +2082,8 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->nextXid = ShmemVariableCache->nextXid;
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
+	CurrentRunningXacts->pendingOldestCatalogXmin =
+		procArray->replication_slot_catalog_xmin;
 
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
@@ -2168,14 +2197,14 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by an existing
+	 * replication slot it's definitely safe to start there, and it can't
+	 * advance while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2965,18 +2994,31 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ *
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs, outdated replicas sending
+ * feedback, etc.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f93..819abf7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -45,6 +45,7 @@ static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlis
 static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason);
 static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
+static void UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin);
 
 
 /*
@@ -822,6 +823,7 @@ standby_redo(XLogReaderState *record)
 		running.latestCompletedXid = xlrec->latestCompletedXid;
 		running.oldestRunningXid = xlrec->oldestRunningXid;
 		running.xids = xlrec->xids;
+		running.pendingOldestCatalogXmin = xlrec->oldestCatalogXmin;
 
 		ProcArrayApplyRecoveryInfo(&running);
 	}
@@ -953,12 +955,24 @@ LogStandbySnapshot(void)
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
+	/*
+	 * Now that we've recorded our intention to allow cleanup of catalog tuples
+	 * no longer needed by our replication slots we can make the new threshold
+	 * effective for vacuum etc.
+	 */
+	UpdateOldestCatalogXmin(running->pendingOldestCatalogXmin);
+
 	return recptr;
 }
 
 /*
  * Record an enhanced snapshot of running transactions into WAL.
  *
+ * We also record the value of procArray->replication_slot_catalog_xmin
+ * obtained from GetRunningTransactionData here, so standbys know we're about
+ * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
+ * removing dead catalog tuples below that threshold.
+ *
  * The definitions of RunningTransactionsData and xl_xact_running_xacts are
  * similar. We keep them separate because xl_xact_running_xacts is a
  * contiguous chunk of memory and never exists fully until it is assembled in
@@ -977,6 +991,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
+	xlrec.oldestCatalogXmin = CurrRunningXacts->pendingOldestCatalogXmin;
 
 	/* Header */
 	XLogBeginInsert();
@@ -1021,6 +1036,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	return recptr;
 }
 
+static void
+UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
+		|| (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin)))
+		ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
+
 /*
  * Wholesale logging of AccessExclusiveLocks. Other lock types need not be
  * logged, as described in backend/storage/lmgr/README.
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..a4ecfb7 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -136,6 +136,17 @@ typedef struct VariableCacheData
 										 * aborted */
 
 	/*
+	 * This field is protected by ProcArrayLock except
+	 * during recovery, when it's set unlocked.
+	 *
+	 * oldestCatalogXmin is the oldest xid it is
+	 * guaranteed to be safe to create a historic
+	 * snapshot for. See also
+	 * procArray->replication_slot_catalog_xmin
+	 */
+	TransactionId oldestCatalogXmin;
+
+	/*
 	 * These fields are protected by CLogTruncationLock
 	 */
 	TransactionId oldestClogXid;	/* oldest it's safe to look up in clog */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..1fe89ae 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..05ace64 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,7 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..7756a27 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -65,6 +65,10 @@ extern void StandbyReleaseOldLocks(int nxids, TransactionId *xids);
  * is written to WAL as a separate record immediately after each
  * checkpoint. That means that wherever we start a standby from we will
  * almost immediately see the data we need to begin executing queries.
+ *
+ * Information about the oldest catalog_xmin needed by any replication slot is
+ * also included here, so we can use it to update the catalog tuple removal
+ * limit and convey the new limit to standbys.
  */
 
 typedef struct RunningTransactionsData
@@ -75,6 +79,8 @@ typedef struct RunningTransactionsData
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	/* so we can update ShmemVariableCache->oldestCatalogXmin: */
+	TransactionId pendingOldestCatalogXmin;
 
 	TransactionId *xids;		/* array of (sub)xids still running */
 } RunningTransactionsData;
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index f8444c7..6153675 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -52,6 +52,7 @@ typedef struct xl_running_xacts
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	TransactionId oldestCatalogXmin;	/* oldest safe historic snapshot */
 
 	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..2cfa9ac 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 25;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -17,6 +17,10 @@ $node_master->append_conf(
 wal_level = logical
 ));
 $node_master->start;
+
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after start");
+
 my $backup_name = 'master_backup';
 
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -96,9 +100,18 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+	"pg_controldata's oldestCatalogXmin is nonzero");
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+# First checkpoint forces xl_running_xacts with the new oldestCatalogXmin
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+# Then we need a second checkpoint to write the control file with the new value
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint");
 
 # done with the node
 $node_master->stop;
-- 
2.5.5

#82Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Craig Ringer (#81)
1 attachment(s)
Re: Logical decoding on standby

On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:

Attached.

* Cleaned up in 3 places
* Added code for faked up RunningTransactions in xlog.c
* Ensure catalog_xmin doesn't go backwards

All else looks good. Comments before commit?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

log-catalog-xmin-advances-v4.patchapplication/octet-stream; name=log-catalog-xmin-advances-v4.patchDownload
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..d1400ec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use oldestCatalogXmin here */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 278546a..4aaae59 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -21,10 +21,11 @@ standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
 	int			i;
 
-	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
+	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u oldestCatalogXmin %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
-					 xlrec->oldestRunningXid);
+					 xlrec->oldestRunningXid,
+					 xlrec->oldestCatalogXmin);
 	if (xlrec->xcnt > 0)
 	{
 		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..6094465 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5021,6 +5021,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6612,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6632,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -6913,6 +6918,7 @@ StartupXLOG(void)
 				Assert(TransactionIdIsNormal(latestCompletedXid));
 				running.latestCompletedXid = latestCompletedXid;
 				running.xids = xids;
+				running.pendingOldestCatalogXmin = InvalidTransactionId;
 
 				ProcArrayApplyRecoveryInfo(&running);
 
@@ -8704,6 +8710,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+
 	LWLockAcquire(CommitTsLock, LW_SHARED);
 	checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
 	checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
@@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 
 		/*
+		 * There can be no concurrent writers to oldestCatalogXmin during
+		 * recovery, so no need to take ProcArrayLock.
+		 */
+		ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
+
+		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
 		 * record, the backup was canceled and the end-of-backup record will
 		 * never arrive.
@@ -9675,6 +9691,7 @@ xlog_redo(XLogReaderState *record)
 			Assert(TransactionIdIsNormal(latestCompletedXid));
 			running.latestCompletedXid = latestCompletedXid;
 			running.xids = xids;
+			running.pendingOldestCatalogXmin = InvalidTransactionId;
 
 			ProcArrayApplyRecoveryInfo(&running);
 
@@ -9731,6 +9748,15 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * There can be no concurrent writers to oldestCatalogXmin during
+		 * recovery, so no need to take ProcArrayLock.
+		 */
+		if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
+									checkPoint.oldestCatalogXmin)
+			ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..76155bf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -68,6 +68,8 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -218,6 +220,7 @@ CreateInitDecodingContext(char *plugin,
 	ReplicationSlot *slot;
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
+	bool force_standby_snapshot;
 
 	/* shorter lines... */
 	slot = MyReplicationSlot;
@@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	/*
+	 * If this is the first slot created on the master we won't have a
+	 * persistent record of the oldest safe xid for historic snapshots yet.
+	 * Force one to be recorded so that when we go to replay from this slot we
+	 * know it's safe.
+	 */
+	force_standby_snapshot =
+		!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
+
 	LWLockRelease(ProcArrayLock);
 
+	/* Update ShmemVariableCache->oldestCatalogXmin */
+	if (force_standby_snapshot)
+		LogStandbySnapshot();
+
 	/*
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
@@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+	EnsureActiveLogicalSlotValid();
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId,
 								 read_page, prepare_write, do_write);
@@ -963,3 +981,39 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid(void)
+{
+	TransactionId shmem_catalog_xmin;
+
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * A logical slot can become unusable if we're doing logical decoding on a
+	 * standby or using a slot created before we were promoted from standby
+	 * to master. If the master advanced its global catalog_xmin past the
+	 * threshold we need it could've removed catalog tuple versions that
+	 * we'll require to start decoding at our restart_lsn.
+	 *
+	 * We need a barrier so that if we decode in recovery on a standby we
+	 * don't allow new decoding sessions to start after redo has advanced
+	 * the threshold.
+	 */
+	if (RecoveryInProgress())
+		pg_memory_barrier();
+
+	shmem_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+		TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("replication slot '%s' requires catalogs removed by master",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("need catalog_xmin %u, have oldestCatalogXmin %u",
+						   MyReplicationSlot->data.catalog_xmin, shmem_catalog_xmin)));
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 771ac30..c2ad791 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1233,7 +1233,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..cdc5f95 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	 * be energy wasted - the worst lost information can do here is give us
 	 * wrong information in a statistics view - we'll just potentially be more
 	 * conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
 	 */
 }
 
@@ -1778,6 +1783,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..48b18ec 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,11 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -679,6 +683,20 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
 
 	/*
+	 * Update our knowledge of the oldest xid we can safely create historic
+	 * snapshots for.
+	 *
+	 * There can be no concurrent writers to oldestCatalogXmin during
+	 * recovery, so no need to take ProcArrayLock.
+	 *
+	 * If we allow logical decoding on standbys in future we must raise
+	 * recovery conflicts with catalog_xmin advances here.
+	 */
+	if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
+									  running->oldestRunningXid))
+		ShmemVariableCache->oldestCatalogXmin = running->pendingOldestCatalogXmin;
+
+	/*
 	 * Remove stale locks, if any.
 	 *
 	 * Locks are always assigned to the toplevel xid so we don't need to care
@@ -1306,6 +1324,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1493,7 +1514,8 @@ GetMaxSnapshotSubxidCount(void)
  *			older than this are known not running any more.
  *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
  *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by GetOldestXmin(true, true).
+ *			the same computation done by
+ *			GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT|PROCARRAY_FLAGS_VACUUM)
  *		RecentGlobalDataXmin: the global xmin for non-catalog tables
  *			>= RecentGlobalXmin
  *
@@ -1700,7 +1722,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1733,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2041,12 +2066,16 @@ GetRunningTransactionData(void)
 	}
 
 	/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
 	 * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
 	 * were to be included here the initial value could never increase because
 	 * of a circular dependency where slots only increase their limits when
 	 * running xacts increases oldestRunningXid and running xacts only
 	 * increases if slots do.
+	 *
+	 * We can include the catalog_xmin limit here; there's no similar
+	 * circularity, and we need it to log xl_running_xacts records for
+	 * standbys.
 	 */
 
 	CurrentRunningXacts->xcnt = count - subcount;
@@ -2055,6 +2084,8 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->nextXid = ShmemVariableCache->nextXid;
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
+	CurrentRunningXacts->pendingOldestCatalogXmin =
+		procArray->replication_slot_catalog_xmin;
 
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
@@ -2168,14 +2199,14 @@ GetOldestSafeDecodingTransactionId(void)
 	oldestSafeXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * If there's already a slot pegging the xmin horizon, we can start with
-	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * If there's already an effectiveCatalogXmin held down by an existing
+	 * replication slot it's definitely safe to start there, and it can't
+	 * advance while we hold ProcArrayLock.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
-		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+	if (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
 							  oldestSafeXid))
-		oldestSafeXid = procArray->replication_slot_catalog_xmin;
+		oldestSafeXid = ShmemVariableCache->oldestCatalogXmin;
 
 	/*
 	 * If we're not in recovery, we walk over the procarray and collect the
@@ -2965,18 +2996,31 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ *
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs, outdated replicas sending
+ * feedback, etc.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f93..819abf7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -45,6 +45,7 @@ static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlis
 static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason);
 static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
+static void UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin);
 
 
 /*
@@ -822,6 +823,7 @@ standby_redo(XLogReaderState *record)
 		running.latestCompletedXid = xlrec->latestCompletedXid;
 		running.oldestRunningXid = xlrec->oldestRunningXid;
 		running.xids = xlrec->xids;
+		running.pendingOldestCatalogXmin = xlrec->oldestCatalogXmin;
 
 		ProcArrayApplyRecoveryInfo(&running);
 	}
@@ -953,12 +955,24 @@ LogStandbySnapshot(void)
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
+	/*
+	 * Now that we've recorded our intention to allow cleanup of catalog tuples
+	 * no longer needed by our replication slots we can make the new threshold
+	 * effective for vacuum etc.
+	 */
+	UpdateOldestCatalogXmin(running->pendingOldestCatalogXmin);
+
 	return recptr;
 }
 
 /*
  * Record an enhanced snapshot of running transactions into WAL.
  *
+ * We also record the value of procArray->replication_slot_catalog_xmin
+ * obtained from GetRunningTransactionData here, so standbys know we're about
+ * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
+ * removing dead catalog tuples below that threshold.
+ *
  * The definitions of RunningTransactionsData and xl_xact_running_xacts are
  * similar. We keep them separate because xl_xact_running_xacts is a
  * contiguous chunk of memory and never exists fully until it is assembled in
@@ -977,6 +991,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
+	xlrec.oldestCatalogXmin = CurrRunningXacts->pendingOldestCatalogXmin;
 
 	/* Header */
 	XLogBeginInsert();
@@ -1021,6 +1036,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	return recptr;
 }
 
+static void
+UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
+		|| (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin)))
+		ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
+
 /*
  * Wholesale logging of AccessExclusiveLocks. Other lock types need not be
  * logged, as described in backend/storage/lmgr/README.
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..a4ecfb7 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -136,6 +136,17 @@ typedef struct VariableCacheData
 										 * aborted */
 
 	/*
+	 * This field is protected by ProcArrayLock except
+	 * during recovery, when it's set unlocked.
+	 *
+	 * oldestCatalogXmin is the oldest xid it is
+	 * guaranteed to be safe to create a historic
+	 * snapshot for. See also
+	 * procArray->replication_slot_catalog_xmin
+	 */
+	TransactionId oldestCatalogXmin;
+
+	/*
 	 * These fields are protected by CLogTruncationLock
 	 */
 	TransactionId oldestClogXid;	/* oldest it's safe to look up in clog */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index c09c0f8..0621845 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
 /*
  * Each page of XLOG file has a header like this:
  */
-#define XLOG_PAGE_MAGIC 0xD097	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD100	/* can be used as WAL version indicator */
 
 typedef struct XLogPageHeaderData
 {
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..b9461b3 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -23,7 +23,7 @@
 #define MOCK_AUTH_NONCE_LEN		32
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1002
+#define PG_CONTROL_VERSION	1003
 
 /*
  * Body of CheckPoint XLOG records.  This is declared here because we keep
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..05ace64 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,7 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..7756a27 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -65,6 +65,10 @@ extern void StandbyReleaseOldLocks(int nxids, TransactionId *xids);
  * is written to WAL as a separate record immediately after each
  * checkpoint. That means that wherever we start a standby from we will
  * almost immediately see the data we need to begin executing queries.
+ *
+ * Information about the oldest catalog_xmin needed by any replication slot is
+ * also included here, so we can use it to update the catalog tuple removal
+ * limit and convey the new limit to standbys.
  */
 
 typedef struct RunningTransactionsData
@@ -75,6 +79,8 @@ typedef struct RunningTransactionsData
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	/* so we can update ShmemVariableCache->oldestCatalogXmin: */
+	TransactionId pendingOldestCatalogXmin;
 
 	TransactionId *xids;		/* array of (sub)xids still running */
 } RunningTransactionsData;
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index f8444c7..6153675 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -52,6 +52,7 @@ typedef struct xl_running_xacts
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	TransactionId oldestCatalogXmin;	/* oldest safe historic snapshot */
 
 	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..2cfa9ac 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 25;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -17,6 +17,10 @@ $node_master->append_conf(
 wal_level = logical
 ));
 $node_master->start;
+
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after start");
+
 my $backup_name = 'master_backup';
 
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -96,9 +100,18 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+	"pg_controldata's oldestCatalogXmin is nonzero");
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+# First checkpoint forces xl_running_xacts with the new oldestCatalogXmin
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+# Then we need a second checkpoint to write the control file with the new value
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+command_like(['pg_controldata', $node_master->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint");
 
 # done with the node
 $node_master->stop;
#83Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#82)
Re: Logical decoding on standby

On 2017-03-30 15:26:02 +0100, Simon Riggs wrote:

On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:

Attached.

* Cleaned up in 3 places
* Added code for faked up RunningTransactions in xlog.c
* Ensure catalog_xmin doesn't go backwards

All else looks good. Comments before commit?

Can you give me till after lunch?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Andres Freund (#83)
Re: Logical decoding on standby

On 30 March 2017 at 15:27, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-30 15:26:02 +0100, Simon Riggs wrote:

On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:

Attached.

* Cleaned up in 3 places
* Added code for faked up RunningTransactions in xlog.c
* Ensure catalog_xmin doesn't go backwards

All else looks good. Comments before commit?

Can you give me till after lunch?

Sure, np

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#82)
Re: Logical decoding on standby

@@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);

/*
+		 * There can be no concurrent writers to oldestCatalogXmin during
+		 * recovery, so no need to take ProcArrayLock.
+		 */
+		ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

s/writers/writes/?

@@ -9731,6 +9748,15 @@ xlog_redo(XLogReaderState *record)
checkPoint.oldestXid))
SetTransactionIdLimit(checkPoint.oldestXid,
checkPoint.oldestXidDB);
+
+		/*
+		 * There can be no concurrent writers to oldestCatalogXmin during
+		 * recovery, so no need to take ProcArrayLock.
+		 */
+		if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
+									checkPoint.oldestCatalogXmin)
+			ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

dito.

@@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,

ReplicationSlotsComputeRequiredXmin(true);

+	/*
+	 * If this is the first slot created on the master we won't have a
+	 * persistent record of the oldest safe xid for historic snapshots yet.
+	 * Force one to be recorded so that when we go to replay from this slot we
+	 * know it's safe.
+	 */
+	force_standby_snapshot =
+		!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);

s/first slot/first logical slot/. Also, the reference to master doesn't
seem right.

LWLockRelease(ProcArrayLock);

+	/* Update ShmemVariableCache->oldestCatalogXmin */
+	if (force_standby_snapshot)
+		LogStandbySnapshot();

The comment and code don't quite square to me - it's far from obvious
that LogStandbySnapshot does something like that. I'd even say it's a
bad idea to have it do that.

/*
* tell the snapshot builder to only assemble snapshot once reaching the
* running_xact's record with the respective xmin.
@@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
start_lsn = slot->data.confirmed_flush;
}

+ EnsureActiveLogicalSlotValid();

This seems like it should be in a separate patch, and seperately
reviewed. It's code that's currently unreachable (and thus untestable).

+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid(void)
+{
+	TransactionId shmem_catalog_xmin;
+
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * A logical slot can become unusable if we're doing logical decoding on a
+	 * standby or using a slot created before we were promoted from standby
+	 * to master.

Neither of those is currently possible...

If the master advanced its global catalog_xmin past the
+	 * threshold we need it could've removed catalog tuple versions that
+	 * we'll require to start decoding at our restart_lsn.
+	 *
+	 * We need a barrier so that if we decode in recovery on a standby we
+	 * don't allow new decoding sessions to start after redo has advanced
+	 * the threshold.
+	 */
+	if (RecoveryInProgress())
+		pg_memory_barrier();

I don't think this is a meaningful locking protocol. It's a bad idea to
use lock-free programming without need, especially when the concurrency
protocol isn't well defined. With what other barrier does this pair
with? What prevents the data being out of date by the time we actually
do the check?

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..cdc5f95 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
* be energy wasted - the worst lost information can do here is give us
* wrong information in a statistics view - we'll just potentially be more
* conservative in removing files.
+	 *
+	 * We don't have to do any effective_xmin / effective_catalog_xmin testing
+	 * here either, like for LogicalConfirmReceivedLocation. If we received
+	 * the xmin and catalog_xmin from downstream replication slots we know they
+	 * were already confirmed there,
*/
}

This comment reads as if LogicalConfirmReceivedLocation had
justification for not touching / checking catalog_xmin. But it does.

/*
+	 * Update our knowledge of the oldest xid we can safely create historic
+	 * snapshots for.
+	 *
+	 * There can be no concurrent writers to oldestCatalogXmin during
+	 * recovery, so no need to take ProcArrayLock.

By now I think is pretty flawed logic, because there can be concurrent
readers, that need to be safe against oldestCatalogXmin advancing
concurrently.

/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
* snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
* were to be included here the initial value could never increase because
* of a circular dependency where slots only increase their limits when
* running xacts increases oldestRunningXid and running xacts only
* increases if slots do.
+	 *
+	 * We can include the catalog_xmin limit here; there's no similar
+	 * circularity, and we need it to log xl_running_xacts records for
+	 * standbys.
*/

Those comments seem to need some more heavyhanded reconciliation.

*
* Return the current slot xmin limits. That's useful to be able to remove
* data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective

This seems to need some light editing.

/*
* Record an enhanced snapshot of running transactions into WAL.
*
+ * We also record the value of procArray->replication_slot_catalog_xmin
+ * obtained from GetRunningTransactionData here, so standbys know we're about
+ * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
+ * removing dead catalog tuples below that threshold.

I think needs some rephrasing. We're not necessarily about to remove
catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

+static void
+UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
+		|| (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin)))
+		ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}

Doing TransactionIdPrecedes before ensuring
ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
me as a good idea. Generally, the expression as it stands is hard to
understand.

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..a4ecfb7 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -136,6 +136,17 @@ typedef struct VariableCacheData
* aborted */
/*
+	 * This field is protected by ProcArrayLock except
+	 * during recovery, when it's set unlocked.
+	 *
+	 * oldestCatalogXmin is the oldest xid it is
+	 * guaranteed to be safe to create a historic
+	 * snapshot for. See also
+	 * procArray->replication_slot_catalog_xmin
+	 */
+	TransactionId oldestCatalogXmin;

Maybe it'd be better to rephrase that do something like
"oldestCatalogXmin guarantees that no valid catalog tuples >= than it
are removed. That property is used for logical decoding.". or similar?

/*
* Each page of XLOG file has a header like this:
*/
-#define XLOG_PAGE_MAGIC 0xD097	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD100	/* can be used as WAL version indicator */

We normally only advance this by one, it's not tied to the poistgres version.

I'm sorry, but to me this patch isn't ready. I'm also doubtful that it
makes a whole lot of sense on its own, without having finished the
design for decoding on a standby - it seems quite likely that we'll have
to redesign the mechanisms in here a bit for that. For 10 this seems to
do nothing but add overhead?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#75)
Re: Logical decoding on standby

On 2017-03-29 08:01:34 +0800, Craig Ringer wrote:

On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:

--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
<para>
Drops a replication slot, freeing any reserved server-side resources. If
the slot is currently in use by an active connection, this command fails.
+      If the slot is a logical slot that was created in a database other than
+      the database the walsender is connected to, this command fails.
</para>
<variablelist>
<varlistentry>

Shouldn't the docs in the drop database section about this?

DROP DATABASE doesn't really discuss all the resources it drops, but
I'm happy to add mention of replication slots handling.

I don't think that's really comparable, because the other things aren't
global objects, which replication slots are.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Andres Freund (#85)
Re: Logical decoding on standby

On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote:

/*
* Each page of XLOG file has a header like this:
*/
-#define XLOG_PAGE_MAGIC 0xD097       /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD100       /* can be used as WAL version indicator */

We normally only advance this by one, it's not tied to the poistgres version.

That was my addition. I rounded it up cos this is release 10. No biggie.

(Poistgres? Is that the Manhattan spelling?)

I'm sorry, but to me this patch isn't ready. I'm also doubtful that it
makes a whole lot of sense on its own, without having finished the
design for decoding on a standby - it seems quite likely that we'll have
to redesign the mechanisms in here a bit for that. For 10 this seems to
do nothing but add overhead?

I'm sure we can reword the comments.

We've been redesigning the mechanisms for 2 years now, so it seems
unlikely that further redesign can be required. If it is required,
this patch is fairly low touch and change is possible later,
incremental development etc. As regards overhead, this adds a small
amount of time to a background process executed every 10 secs,
generates no new WAL records.

So I don't see any reason not to commit this feature, after the minor
corrections.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#87)
Re: Logical decoding on standby

On 2017-03-30 19:40:08 +0100, Simon Riggs wrote:

On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote:

/*
* Each page of XLOG file has a header like this:
*/
-#define XLOG_PAGE_MAGIC 0xD097       /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD100       /* can be used as WAL version indicator */

We normally only advance this by one, it's not tied to the poistgres version.

That was my addition. I rounded it up cos this is release 10. No biggie.

We'll probably upgrade that more than once again this release...

(Poistgres? Is that the Manhattan spelling?)

Tiredness spelling ;)

We've been redesigning the mechanisms for 2 years now, so it seems
unlikely that further redesign can be required.

I don't think that's true *at all* - the mechanism previously
fundamentally different.

The whole topic has largely seen activity shortly before the code
freeze, both last time round and now. I don't think it's surprising
that it thus doesn't end up being ready.

If it is required,
this patch is fairly low touch and change is possible later,
incremental development etc. As regards overhead, this adds a small
amount of time to a background process executed every 10 secs,
generates no new WAL records.

So I don't see any reason not to commit this feature, after the minor
corrections.

It doesn't have any benefit on its own, the locking model doesn't seem
fully there. I don't see much reason to get this in before the release.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#86)
Re: Logical decoding on standby

On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-29 08:01:34 +0800, Craig Ringer wrote:

On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:

--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
<para>
Drops a replication slot, freeing any reserved server-side resources. If
the slot is currently in use by an active connection, this command fails.
+      If the slot is a logical slot that was created in a database other than
+      the database the walsender is connected to, this command fails.
</para>
<variablelist>
<varlistentry>

Shouldn't the docs in the drop database section about this?

DROP DATABASE doesn't really discuss all the resources it drops, but
I'm happy to add mention of replication slots handling.

I don't think that's really comparable, because the other things aren't
global objects, which replication slots are.

Fine by me.

Patch fix-slot-drop-docs.patch, upthread, adds the passage

+
+  <para>
+   Active <link linkend="logicaldecoding-replication-slots">logical
+   replication slots</> count as connections and will prevent a
+   database from being dropped. Inactive slots will be automatically
+   dropped when the database is dropped.
+  </para>

to the notes section of the DROP DATABASE docs; that should do the
trick. I'm not convinced it's worth going into the exceedingly
unlikely race with concurrent slot drop, and we don't seem to in other
places in the docs, like the race you mentioned with connecting to a
db as it's being dropped.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#85)
Re: Logical decoding on standby

On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:

@@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);

/*
+              * There can be no concurrent writers to oldestCatalogXmin during
+              * recovery, so no need to take ProcArrayLock.
+              */
+             ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

s/writers/writes/?

I meant writers, i.e. nothing else can be writing to it. But writes works too.

@@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,

ReplicationSlotsComputeRequiredXmin(true);

+     /*
+      * If this is the first slot created on the master we won't have a
+      * persistent record of the oldest safe xid for historic snapshots yet.
+      * Force one to be recorded so that when we go to replay from this slot we
+      * know it's safe.
+      */
+     force_standby_snapshot =
+             !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);

s/first slot/first logical slot/. Also, the reference to master doesn't
seem right.

Unsure what you mean re reference to master not seeming right.

If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
from the new slot so we need to make sure it gets advanced one we've
decided on our starting catalog_xmin.

LWLockRelease(ProcArrayLock);

+     /* Update ShmemVariableCache->oldestCatalogXmin */
+     if (force_standby_snapshot)
+             LogStandbySnapshot();

The comment and code don't quite square to me - it's far from obvious
that LogStandbySnapshot does something like that. I'd even say it's a
bad idea to have it do that.

So you prefer the prior approach with separate xl_catalog_xmin advance records?

I don't have much preference; I liked the small code reduction of
Simon's proposed approach, but it landed up being a bit awkward in
terms of ordering and locking. I don't think catalog_xmin tracking is
really closely related to the standby snapshot stuff and it feels a
bit like it's a tacked-on afterthought where it is now.

/*
* tell the snapshot builder to only assemble snapshot once reaching the
* running_xact's record with the respective xmin.
@@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
start_lsn = slot->data.confirmed_flush;
}

+ EnsureActiveLogicalSlotValid();

This seems like it should be in a separate patch, and seperately
reviewed. It's code that's currently unreachable (and thus untestable).

It is reached and is run, those checks run whenever creating a
non-initial decoding context on master or replica.

The failure case is reachable if a replica has hot_standby_feedback
off or it's not using a physical slot and loses its connection. If
promoted, any slot existing on that replica (from a file system level
copy when the replica was created) will fail. I agree it's contrived
since we can't create and maintain slots on replicas, which is the
main point of it.

+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid(void)
+{
+     TransactionId shmem_catalog_xmin;
+
+     Assert(MyReplicationSlot != NULL);
+
+     /*
+      * A logical slot can become unusable if we're doing logical decoding on a
+      * standby or using a slot created before we were promoted from standby
+      * to master.

Neither of those is currently possible...

Right. Because it's foundations for decoding on standby.

If the master advanced its global catalog_xmin past the
+      * threshold we need it could've removed catalog tuple versions that
+      * we'll require to start decoding at our restart_lsn.
+      *
+      * We need a barrier so that if we decode in recovery on a standby we
+      * don't allow new decoding sessions to start after redo has advanced
+      * the threshold.
+      */
+     if (RecoveryInProgress())
+             pg_memory_barrier();

I don't think this is a meaningful locking protocol. It's a bad idea to
use lock-free programming without need, especially when the concurrency
protocol isn't well defined.

Yeah. The intended interaction is with recovery conflict on standby
which doesn't look likely to land in this release due to patch
split/cleanup etc. (Not a complaint).

Better to just take a brief shared ProcArrayLock.

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..cdc5f95 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
* be energy wasted - the worst lost information can do here is give us
* wrong information in a statistics view - we'll just potentially be more
* conservative in removing files.
+      *
+      * We don't have to do any effective_xmin / effective_catalog_xmin testing
+      * here either, like for LogicalConfirmReceivedLocation. If we received
+      * the xmin and catalog_xmin from downstream replication slots we know they
+      * were already confirmed there,
*/
}

This comment reads as if LogicalConfirmReceivedLocation had
justification for not touching / checking catalog_xmin. But it does.

It touches it, what it doesn't do is test and only advance if the new
value is greater, like for xmin as referenced in the prior par. Will
clarify.

/*
+      * Update our knowledge of the oldest xid we can safely create historic
+      * snapshots for.
+      *
+      * There can be no concurrent writers to oldestCatalogXmin during
+      * recovery, so no need to take ProcArrayLock.

By now I think is pretty flawed logic, because there can be concurrent
readers, that need to be safe against oldestCatalogXmin advancing
concurrently.

You're right, we'll need a lock or suitable barriers here to ensure
that slot conflict with recovery and startup of new decoding sessions
doesn't see outdated values. This would be the peer of the
pg_memory_barrier() above. Or could just take a lock; there's enough
other locking activity in redo that it should be fine.

/*
-      * It's important *not* to include the limits set by slots here because
+      * It's important *not* to include the xmin set by slots here because
* snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
* were to be included here the initial value could never increase because
* of a circular dependency where slots only increase their limits when
* running xacts increases oldestRunningXid and running xacts only
* increases if slots do.
+      *
+      * We can include the catalog_xmin limit here; there's no similar
+      * circularity, and we need it to log xl_running_xacts records for
+      * standbys.
*/

Those comments seem to need some more heavyhanded reconciliation.

OK. To me it seems clear that the first refers to xmin, the second to
catalog_xmin. But after all I wrote it, and the important thing is
what it says to people who are not me. Will adjust.

*
* Return the current slot xmin limits. That's useful to be able to remove
* data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective

This seems to need some light editing.

catalog_xmin => catalog_xmins I guess.

/*
* Record an enhanced snapshot of running transactions into WAL.
*
+ * We also record the value of procArray->replication_slot_catalog_xmin
+ * obtained from GetRunningTransactionData here, so standbys know we're about
+ * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
+ * removing dead catalog tuples below that threshold.

I think needs some rephrasing. We're not necessarily about to remove
catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

Agreed

+static void
+UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
+{
+     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+     if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
+             || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin)))
+             ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
+     LWLockRelease(ProcArrayLock);
+}

Doing TransactionIdPrecedes before ensuring
ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
me as a good idea. Generally, the expression as it stands is hard to
understand.

OK.

I found other formulations to be long and hard to read. Expressing it
as "if validity has changed or value has increased" made more sense.
Agree order should change.

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..a4ecfb7 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -136,6 +136,17 @@ typedef struct VariableCacheData
* aborted */
/*
+      * This field is protected by ProcArrayLock except
+      * during recovery, when it's set unlocked.
+      *
+      * oldestCatalogXmin is the oldest xid it is
+      * guaranteed to be safe to create a historic
+      * snapshot for. See also
+      * procArray->replication_slot_catalog_xmin
+      */
+     TransactionId oldestCatalogXmin;

Maybe it'd be better to rephrase that do something like
"oldestCatalogXmin guarantees that no valid catalog tuples >= than it
are removed. That property is used for logical decoding.". or similar?

Fine by me.

I'll adjust this per discussion and per a comment Simon made
separately. Whether we use it right away or not it's worth having it
updated while it's still freshly in mind.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#90)
Re: Logical decoding on standby

On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote:

On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:

The comment and code don't quite square to me - it's far from obvious
that LogStandbySnapshot does something like that. I'd even say it's a
bad idea to have it do that.

So you prefer the prior approach with separate xl_catalog_xmin advance records?

Alternately, we can record the creation timeline on slots, so we know
if there's been a promotion. It wouldn't make sense to do this if that
were the only use of timelines on slots. But I'm aware you'd rather
keep slots timeline-agnostic and I tend to agree.

Anyway, per your advice will split out the validation step.

(I'd like replication origins to be able to track time alongside lsn,
and to pass the timeline of each LSN to output plugin callbacks so we
can detect if a physical promotion results in us backtracking down a
fork in history, but that doesn't affect slots.)

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#90)
1 attachment(s)
Re: Logical decoding on standby

On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote:

On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:

@@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);

/*
+              * There can be no concurrent writers to oldestCatalogXmin during
+              * recovery, so no need to take ProcArrayLock.
+              */
+             ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

s/writers/writes/?

I meant writers, i.e. nothing else can be writing to it. But writes works too.

Fixed.

@@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,

ReplicationSlotsComputeRequiredXmin(true);

+     /*
+      * If this is the first slot created on the master we won't have a
+      * persistent record of the oldest safe xid for historic snapshots yet.
+      * Force one to be recorded so that when we go to replay from this slot we
+      * know it's safe.
+      */
+     force_standby_snapshot =
+             !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);

s/first slot/first logical slot/. Also, the reference to master doesn't
seem right.

Unsure what you mean re reference to master not seeming right.

If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
from the new slot so we need to make sure it gets advanced one we've
decided on our starting catalog_xmin.

Moved to next patch, will address there.

LWLockRelease(ProcArrayLock);

+     /* Update ShmemVariableCache->oldestCatalogXmin */
+     if (force_standby_snapshot)
+             LogStandbySnapshot();

The comment and code don't quite square to me - it's far from obvious
that LogStandbySnapshot does something like that. I'd even say it's a
bad idea to have it do that.

So you prefer the prior approach with separate xl_catalog_xmin advance records?

I don't have much preference; I liked the small code reduction of
Simon's proposed approach, but it landed up being a bit awkward in
terms of ordering and locking. I don't think catalog_xmin tracking is
really closely related to the standby snapshot stuff and it feels a
bit like it's a tacked-on afterthought where it is now.

This code moved to next patch. But we do need to agree on the best approach.

If we're not going to force a standby snapshot here, then it's
probably better to use the separate xl_catalog_xmin design instead.

/*
* tell the snapshot builder to only assemble snapshot once reaching the
* running_xact's record with the respective xmin.
@@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
start_lsn = slot->data.confirmed_flush;
}

+ EnsureActiveLogicalSlotValid();

This seems like it should be in a separate patch, and seperately
reviewed. It's code that's currently unreachable (and thus untestable).

It is reached and is run, those checks run whenever creating a
non-initial decoding context on master or replica.

Again, moved to next patch.

/*
+      * Update our knowledge of the oldest xid we can safely create historic
+      * snapshots for.
+      *
+      * There can be no concurrent writers to oldestCatalogXmin during
+      * recovery, so no need to take ProcArrayLock.

By now I think is pretty flawed logic, because there can be concurrent
readers, that need to be safe against oldestCatalogXmin advancing
concurrently.

You're right, we'll need a lock or suitable barriers here to ensure
that slot conflict with recovery and startup of new decoding sessions
doesn't see outdated values. This would be the peer of the
pg_memory_barrier() above. Or could just take a lock; there's enough
other locking activity in redo that it should be fine.

Now takes ProcArrayLock briefly.

oldestCatalogXmin is also used in GetOldestSafeDecodingTransactionId,
and there we want to prevent it from being advanced. But on further
thought, relying on oldestCatalogXmin there is actually unsafe; on the
master, we might've already logged our intent to advance it to some
greater value of procArray->replication_slot_catalog_xmin and will do
so as ProcArrayLock is released. On standby we're also better off
relying on procArray->replication_slot_catalog_xmin since that's what
we'll be sending in feedback.

Went back to using replication_slot_catalog_xmin here, with comment

*
* We don't use ShmemVariableCache->oldestCatalogXmin here because another
* backend may have already logged its intention to advance it to a higher
* value (still <= replication_slot_catalog_xmin) and just be waiting on
* ProcArrayLock to actually apply the change. On a standby
* replication_slot_catalog_xmin is what the walreceiver will be sending in
* hot_standby_feedback, not oldestCatalogXmin.
*/

/*
-      * It's important *not* to include the limits set by slots here because
+      * It's important *not* to include the xmin set by slots here because
* snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
* were to be included here the initial value could never increase because
* of a circular dependency where slots only increase their limits when
* running xacts increases oldestRunningXid and running xacts only
* increases if slots do.
+      *
+      * We can include the catalog_xmin limit here; there's no similar
+      * circularity, and we need it to log xl_running_xacts records for
+      * standbys.
*/

Those comments seem to need some more heavyhanded reconciliation.

OK. To me it seems clear that the first refers to xmin, the second to
catalog_xmin. But after all I wrote it, and the important thing is
what it says to people who are not me. Will adjust.

Changed to

* We can safely report the catalog_xmin limit for replication slots here
* because it's only used to advance oldestCatalogXmin. Slots'
* catalog_xmin advance does not depend on it so there's no circularity.

*
* Return the current slot xmin limits. That's useful to be able to remove
* data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmin, we return both the effective

This seems to need some light editing.

catalog_xmin => catalog_xmins I guess.

Amended.

/*
* Record an enhanced snapshot of running transactions into WAL.
*
+ * We also record the value of procArray->replication_slot_catalog_xmin
+ * obtained from GetRunningTransactionData here, so standbys know we're about
+ * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
+ * removing dead catalog tuples below that threshold.

I think needs some rephrasing. We're not necessarily about to remove
catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

Agreed

* We also record the value of procArray->replication_slot_catalog_xmin
* obtained from GetRunningTransactionData here. We intend to advance
* ShmemVariableCache->oldestCatalogXmin to it once standbys have been informed
* of the new value, which will permit removal of previously-protected dead
* catalog tuples. The standby needs to know about that before any WAL
* records containing such tuple removals could possibly arrive.

+static void
+UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
+{
+     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+     if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
+             || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin)))
+             ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
+     LWLockRelease(ProcArrayLock);
+}

Doing TransactionIdPrecedes before ensuring
ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
me as a good idea. Generally, the expression as it stands is hard to
understand.

OK.

I found other formulations to be long and hard to read. Expressing it
as "if validity has changed or value has increased" made more sense.
Agree order should change.

Re-ordered, otherwise left the same.

Could add a comment like

"we must set oldestCatalogXmin if its validity has changed or it is advancing"

but seems rather redundant to the code.

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..a4ecfb7 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -136,6 +136,17 @@ typedef struct VariableCacheData
* aborted */
/*
+      * This field is protected by ProcArrayLock except
+      * during recovery, when it's set unlocked.
+      *
+      * oldestCatalogXmin is the oldest xid it is
+      * guaranteed to be safe to create a historic
+      * snapshot for. See also
+      * procArray->replication_slot_catalog_xmin
+      */
+     TransactionId oldestCatalogXmin;

Maybe it'd be better to rephrase that do something like
"oldestCatalogXmin guarantees that no valid catalog tuples >= than it
are removed. That property is used for logical decoding.". or similar?

Fine by me.

I'll adjust this per discussion and per a comment Simon made
separately. Whether we use it right away or not it's worth having it
updated while it's still freshly in mind.

OK, updated catalog_xmin logging patch attached.

Important fix included: when faking up a RunningTransactions snapshot
in StartupXLOG for replay of shutdown checkpoints, copy the
checkpoint's oldestCatalogXmin so we apply it instead of clobbering
the replica's value. It's kind of roundabout to set this once when we
apply the checkpoint and again via ProcArrayApplyRecoveryInfo, but
it's necessary if we're using xl_running_xacts to carry
oldestCatalogXmin info.

Found another issue too. We log our intention to increase
oldestCatalogXmin in LogStandbySnapshot when we write
xl_running_xacts. We then release ProcArrayLock to re-acquire it
LW_EXCLUSIVE so we can increment oldestCatalogXmin in shmem. But a
checkpoint runs and copies the old oldestCatalogXmin value after we
wrote xlog but before we updated in shmem. On the standby, redo will
apply the new value then clobber it with the old one.

To fix this, take CheckpointLock in LogStandbySnapshot (if not called
during a checkpoint) so we can't have the xl_running_xacts with the
new oldestCatalogXmin land up in WAL before a checkpoint with an older
value. Also take oldestCatalogXmin's value after we've forced
LogStandbySnapshot in a checkpoint.

Extended tests a bit to cover redo on standbys.

Personally I'm not a huge fan of how integrating this with logging
standby snapshots has turned out. It seemed to make sense initially,
but I think the way it works out is more convoluted than necessary for
little benefit. I'll prep an updated version of the
xl_advance_catalog_xmin patch with the same fixes for side by side
comparison.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

log-catalog-xmin-advances-v4.patchtext/x-patch; charset=US-ASCII; name=log-catalog-xmin-advances-v4.patchDownload
From 7f742f582e1f6f8f23c4e9d78cd0298180e5387c Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:36:49 +0800
Subject: [PATCH] Log catalog_xmin advances before removing catalog tuples

Before advancing the effective catalog_xmin we use to remove old catalog
tuple versions, make sure it is written to WAL. This allows standbys
to know the oldest xid they can safely create a historic snapshot for.
They can then refuse to start decoding from a slot or raise a recovery
conflict.

The catalog_xmin advance is logged in the next xl_running_xacts records, so
vacuum of catalogs may be held back up to 10 seconds when a replication slot
with catalog_xmin is holding down the global catalog_xmin.
---
 src/backend/access/heap/rewriteheap.c       |  3 +-
 src/backend/access/rmgrdesc/standbydesc.c   |  5 +-
 src/backend/access/rmgrdesc/xlogdesc.c      |  3 +-
 src/backend/access/transam/varsup.c         | 15 +++++
 src/backend/access/transam/xlog.c           | 26 +++++++-
 src/backend/postmaster/bgwriter.c           |  2 +-
 src/backend/replication/slot.c              |  2 +-
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         |  8 +++
 src/backend/storage/ipc/procarray.c         | 61 ++++++++++++++++---
 src/backend/storage/ipc/standby.c           | 60 +++++++++++++++++--
 src/bin/pg_controldata/pg_controldata.c     |  2 +
 src/include/access/transam.h                |  5 ++
 src/include/catalog/pg_control.h            |  1 +
 src/include/storage/procarray.h             |  3 +-
 src/include/storage/standby.h               |  8 ++-
 src/include/storage/standbydefs.h           |  1 +
 src/test/recovery/t/006_logical_decoding.pl | 93 +++++++++++++++++++++++++++--
 18 files changed, 269 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..d1400ec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use oldestCatalogXmin here */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 278546a..4aaae59 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -21,10 +21,11 @@ standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
 	int			i;
 
-	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
+	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u oldestCatalogXmin %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
-					 xlrec->oldestRunningXid);
+					 xlrec->oldestRunningXid,
+					 xlrec->oldestCatalogXmin);
 	if (xlrec->xcnt > 0)
 	{
 		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..a66cfc6 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -47,7 +47,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
-						 "oldest running xid %u; %s",
+						 "oldest running xid %u; oldest catalog xmin %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -63,6 +63,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
+						 checkpoint->oldestCatalogXmin,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5efbfbd..ffabf1c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -414,6 +414,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or from LogCurrentRunningXacts()
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..cec68b2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5021,6 +5021,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6612,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6632,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -6908,6 +6913,8 @@ StartupXLOG(void)
 				running.subxid_overflow = false;
 				running.nextXid = checkPoint.nextXid;
 				running.oldestRunningXid = oldestActiveXID;
+				running.pendingOldestCatalogXmin
+					= checkPoint.oldestCatalogXmin;
 				latestCompletedXid = checkPoint.nextXid;
 				TransactionIdRetreat(latestCompletedXid);
 				Assert(TransactionIdIsNormal(latestCompletedXid));
@@ -8786,7 +8793,16 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+		LogStandbySnapshot(true);
+
+	/*
+	 * We must copy oldestCatalogXmin after the standby snapshot so we get any
+	 * updated value and don't clobber the value written in xl_running_xacts
+	 * with an older one in the checkpoint.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
 
 	START_CRIT_SECTION();
 
@@ -9670,6 +9686,8 @@ xlog_redo(XLogReaderState *record)
 			running.subxid_overflow = false;
 			running.nextXid = checkPoint.nextXid;
 			running.oldestRunningXid = oldestActiveXID;
+			running.pendingOldestCatalogXmin =
+				checkPoint.oldestCatalogXmin;
 			latestCompletedXid = checkPoint.nextXid;
 			TransactionIdRetreat(latestCompletedXid);
 			Assert(TransactionIdIsNormal(latestCompletedXid));
@@ -9729,8 +9747,10 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..258c955 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -330,7 +330,7 @@ BackgroundWriterMain(void)
 			if (now >= timeout &&
 				last_snapshot_lsn < GetLastImportantRecPtr())
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				last_snapshot_lsn = LogStandbySnapshot(false);
 				last_snapshot_ts = now;
 			}
 		}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6c5ec7a..605990f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -947,7 +947,7 @@ ReplicationSlotReserveWal(void)
 			slot->data.restart_lsn = GetXLogInsertRecPtr();
 
 			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
+			flushptr = LogStandbySnapshot(false);
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index df93265..277f196 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1234,7 +1234,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..7c46e24 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1778,6 +1778,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..898a5ca 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,11 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -679,6 +683,15 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
 
 	/*
+	 * Update our knowledge of the oldest xid we can safely create historic
+	 * snapshots for.
+	 *
+	 * If we allow logical decoding on standbys in future we must raise
+	 * recovery conflicts with catalog_xmin advances here.
+	 */
+	SetOldestCatalogXmin(running->pendingOldestCatalogXmin);
+
+	/*
 	 * Remove stale locks, if any.
 	 *
 	 * Locks are always assigned to the toplevel xid so we don't need to care
@@ -1306,6 +1319,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1700,7 +1716,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1727,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2041,12 +2060,16 @@ GetRunningTransactionData(void)
 	}
 
 	/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
 	 * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
 	 * were to be included here the initial value could never increase because
-	 * of a circular dependency where slots only increase their limits when
-	 * running xacts increases oldestRunningXid and running xacts only
+	 * of a circular dependency where slots only increase their xmin limits
+	 * when running xacts increases oldestRunningXid and running xacts only
 	 * increases if slots do.
+	 *
+	 * We can safely report the catalog_xmin limit for replication slots here
+	 * because it's only used to advance oldestCatalogXmin. Slots' catalog_xmin
+	 * advance does not depend on it so there's no circularity.
 	 */
 
 	CurrentRunningXacts->xcnt = count - subcount;
@@ -2055,6 +2078,8 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->nextXid = ShmemVariableCache->nextXid;
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
+	CurrentRunningXacts->pendingOldestCatalogXmin =
+		procArray->replication_slot_catalog_xmin;
 
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
@@ -2171,6 +2196,13 @@ GetOldestSafeDecodingTransactionId(void)
 	 * If there's already a slot pegging the xmin horizon, we can start with
 	 * that value, it's guaranteed to be safe since it's computed by this
 	 * routine initially and has been enforced since.
+	 *
+	 * We don't use ShmemVariableCache->oldestCatalogXmin here because another
+	 * backend may have already logged its intention to advance it to a higher
+	 * value (still <= replication_slot_catalog_xmin) and just be waiting on
+	 * ProcArrayLock to actually apply the change. On a standby
+	 * replication_slot_catalog_xmin is what the walreceiver will be sending in
+	 * hot_standby_feedback, not oldestCatalogXmin.
 	 */
 	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
 		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
@@ -2965,18 +2997,31 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmins, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ *
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs, outdated replicas sending
+ * feedback, etc.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f93..7f73180 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -45,6 +45,7 @@ static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlis
 static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason);
 static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
+static void UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin);
 
 
 /*
@@ -822,6 +823,7 @@ standby_redo(XLogReaderState *record)
 		running.latestCompletedXid = xlrec->latestCompletedXid;
 		running.oldestRunningXid = xlrec->oldestRunningXid;
 		running.xids = xlrec->xids;
+		running.pendingOldestCatalogXmin = xlrec->oldestCatalogXmin;
 
 		ProcArrayApplyRecoveryInfo(&running);
 	}
@@ -906,7 +908,7 @@ standby_redo(XLogReaderState *record)
  * Returns the RecPtr of the last inserted record.
  */
 XLogRecPtr
-LogStandbySnapshot(void)
+LogStandbySnapshot(bool in_checkpoint)
 {
 	XLogRecPtr	recptr;
 	RunningTransactions running;
@@ -924,6 +926,17 @@ LogStandbySnapshot(void)
 	pfree(locks);
 
 	/*
+	 * We must lock out concurrent checkpoints so that a checkpoint doesn't
+	 * copy oldestCatalogXmin after we've written a pending new value to xlog
+	 * but before we've updated it in shmem. Otherwise a standby will get the
+	 * old value after replaying the checkpoint.
+	 */
+	if (in_checkpoint)
+		Assert(LWLockHeldByMe(CheckpointLock));
+	else
+		LWLockAcquire(CheckpointLock, LW_SHARED);
+
+	/*
 	 * Log details of all in-progress transactions. This should be the last
 	 * record we write, because standby will open up when it sees this.
 	 */
@@ -953,12 +966,29 @@ LogStandbySnapshot(void)
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
+	/*
+	 * Now that we've recorded our intention to allow cleanup of catalog tuples
+	 * no longer needed by our replication slots we can make the new threshold
+	 * effective for vacuum etc.
+	 */
+	UpdateOldestCatalogXmin(running->pendingOldestCatalogXmin);
+
+	if (!in_checkpoint)
+		LWLockRelease(CheckpointLock);
+
 	return recptr;
 }
 
 /*
  * Record an enhanced snapshot of running transactions into WAL.
  *
+ * We also record the value of procArray->replication_slot_catalog_xmin
+ * obtained from GetRunningTransactionData here. We intend to advance
+ * ShmemVariableCache->oldestCatalogXmin to it once standbys have been informed
+ * of the new value, which will permit removal of previously-protected dead
+ * catalog tuples. The standby needs to know about that before any WAL
+ * records containing such tuple removals could possibly arrive.
+ *
  * The definitions of RunningTransactionsData and xl_xact_running_xacts are
  * similar. We keep them separate because xl_xact_running_xacts is a
  * contiguous chunk of memory and never exists fully until it is assembled in
@@ -977,6 +1007,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
+	xlrec.oldestCatalogXmin = CurrRunningXacts->pendingOldestCatalogXmin;
 
 	/* Header */
 	XLogBeginInsert();
@@ -992,20 +1023,22 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 
 	if (CurrRunningXacts->subxid_overflow)
 		elog(trace_recovery(DEBUG2),
-			 "snapshot of %u running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+			 "snapshot of %u running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u oldestCatalogXmin %u)",
 			 CurrRunningXacts->xcnt,
 			 (uint32) (recptr >> 32), (uint32) recptr,
 			 CurrRunningXacts->oldestRunningXid,
 			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+			 CurrRunningXacts->nextXid,
+			 CurrRunningXacts->pendingOldestCatalogXmin);
 	else
 		elog(trace_recovery(DEBUG2),
-			 "snapshot of %u+%u running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+			 "snapshot of %u+%u running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u oldestCatalogXmin %u)",
 			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
 			 (uint32) (recptr >> 32), (uint32) recptr,
 			 CurrRunningXacts->oldestRunningXid,
 			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+			 CurrRunningXacts->nextXid,
+			 CurrRunningXacts->pendingOldestCatalogXmin);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
@@ -1022,6 +1055,23 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 }
 
 /*
+ * Advance the oldestCatalogXmin used for removal of dead catalog tuples to the
+ * lowest catalog_xmin threshold of any current replication slots.
+ *
+ * Should only be called during a checkpoint or after writing an xlog record to
+ * record the advance.
+ */
+static void
+UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	if ((TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin))
+		|| TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin))
+		ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * Wholesale logging of AccessExclusiveLocks. Other lock types need not be
  * logged, as described in backend/storage/lmgr/README.
  */
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..0123fc8 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+	TransactionId oldestCatalogXmin;	/* oldestCatalogXmin guarantees that no
+										 * valid catalog tuples >= than it are
+										 * removed. That property is used for
+										 * logical decoding. */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -179,6 +183,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..1fe89ae 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..05ace64 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,7 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..4728fa5 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -65,6 +65,10 @@ extern void StandbyReleaseOldLocks(int nxids, TransactionId *xids);
  * is written to WAL as a separate record immediately after each
  * checkpoint. That means that wherever we start a standby from we will
  * almost immediately see the data we need to begin executing queries.
+ *
+ * Information about the oldest catalog_xmin needed by any replication slot is
+ * also included here, so we can use it to update the catalog tuple removal
+ * limit and convey the new limit to standbys.
  */
 
 typedef struct RunningTransactionsData
@@ -75,6 +79,8 @@ typedef struct RunningTransactionsData
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	/* so we can update ShmemVariableCache->oldestCatalogXmin: */
+	TransactionId pendingOldestCatalogXmin;
 
 	TransactionId *xids;		/* array of (sub)xids still running */
 } RunningTransactionsData;
@@ -84,7 +90,7 @@ typedef RunningTransactionsData *RunningTransactions;
 extern void LogAccessExclusiveLock(Oid dbOid, Oid relOid);
 extern void LogAccessExclusiveLockPrepare(void);
 
-extern XLogRecPtr LogStandbySnapshot(void);
+extern XLogRecPtr LogStandbySnapshot(bool in_checkpoint);
 extern void LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 						bool relcacheInitFileInval);
 
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index f8444c7..6153675 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -52,6 +52,7 @@ typedef struct xl_running_xacts
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	TransactionId oldestCatalogXmin;	/* oldest safe historic snapshot */
 
 	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..a1c92c8 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,24 +7,79 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 44;
 
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
-$node_master->append_conf(
-		'postgresql.conf', qq(
+$node_master->append_conf('postgresql.conf', qq(
 wal_level = logical
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+log_min_messages = debug1
 ));
 $node_master->start;
-my $backup_name = 'master_backup';
 
+# Set up some changes before we make base backups
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
 $node_master->safe_psql('postgres', qq[SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');]);
 
 $node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
 
+# Launch two streaming replicas, one with and one without
+# physical replication slots. We'll use these for tests
+# involving interaction of logical and physical standby.
+#
+# Both backups are created with pg_basebackup.
+#
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+$node_master->safe_psql('postgres', q[SELECT pg_create_physical_replication_slot('slot_replica');]);
+my $node_slot_replica = get_new_node('slot_replica');
+$node_slot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_slot_replica->append_conf('recovery.conf', "primary_slot_name = 'slot_replica'");
+
+my $node_noslot_replica = get_new_node('noslot_replica');
+$node_noslot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+
+$node_slot_replica->start;
+$node_noslot_replica->start;
+
+sub restartpoint_standbys
+{
+	# Force restartpoints to update control files on replicas
+	$node_slot_replica->safe_psql('postgres', 'CHECKPOINT');
+	$node_noslot_replica->safe_psql('postgres', 'CHECKPOINT');
+}
+
+sub wait_standbys
+{
+	my $lsn = $node_master->lsn('insert');
+	$node_master->wait_for_catchup($node_noslot_replica, 'replay', $lsn);
+	$node_master->wait_for_catchup($node_slot_replica, 'replay', $lsn);
+}
+
+# pg_basebackup doesn't copy replication slots
+is($node_slot_replica->slot('test_slot')->{'slot_name'}, undef,
+	'logical slot test_slot on master not copied by pg_basebackup');
+
+# Make sure oldestCatalogXmin lands in the control file on master
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my @nodes = ($node_master, $node_slot_replica, $node_noslot_replica);
+
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	# Master had an oldestCatalogXmin, so we must've inherited it via checkpoint
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero after start on " . $node->name);
+}
+
 # Basic decoding works
 my($result) = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
 is(scalar(my @foobar = split /^/m, $result), 12, 'Decoding produced 12 rows inc BEGIN/COMMIT');
@@ -64,6 +119,9 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+# Create a second DB we'll use for testing dropping and accessing slots across
+# databases. This matters since logical slots are globally visible objects that
+# can only actually be used on one DB for most purposes.
 $node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
 
 is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
@@ -96,9 +154,32 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero on " . $node->name);
+}
+
+# Dropping the slot must clear catalog_xmin
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+# First checkpoint forces xl_running_xacts with the new oldestCatalogXmin
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+# Then we need a second checkpoint to write the control file with the new value
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+		"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint on " . $node->name);
+}
 
-# done with the node
-$node_master->stop;
+foreach my $node (@nodes)
+{
+	$node->stop;
+}
-- 
2.5.5

#93Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#92)
1 attachment(s)
Re: Logical decoding on standby

On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote:

OK, updated catalog_xmin logging patch attached.

Ahem, that should be v5.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

log-catalog-xmin-advances-v5.patchtext/x-patch; charset=US-ASCII; name=log-catalog-xmin-advances-v5.patchDownload
From 7f742f582e1f6f8f23c4e9d78cd0298180e5387c Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:36:49 +0800
Subject: [PATCH] Log catalog_xmin advances before removing catalog tuples

Before advancing the effective catalog_xmin we use to remove old catalog
tuple versions, make sure it is written to WAL. This allows standbys
to know the oldest xid they can safely create a historic snapshot for.
They can then refuse to start decoding from a slot or raise a recovery
conflict.

The catalog_xmin advance is logged in the next xl_running_xacts records, so
vacuum of catalogs may be held back up to 10 seconds when a replication slot
with catalog_xmin is holding down the global catalog_xmin.
---
 src/backend/access/heap/rewriteheap.c       |  3 +-
 src/backend/access/rmgrdesc/standbydesc.c   |  5 +-
 src/backend/access/rmgrdesc/xlogdesc.c      |  3 +-
 src/backend/access/transam/varsup.c         | 15 +++++
 src/backend/access/transam/xlog.c           | 26 +++++++-
 src/backend/postmaster/bgwriter.c           |  2 +-
 src/backend/replication/slot.c              |  2 +-
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         |  8 +++
 src/backend/storage/ipc/procarray.c         | 61 ++++++++++++++++---
 src/backend/storage/ipc/standby.c           | 60 +++++++++++++++++--
 src/bin/pg_controldata/pg_controldata.c     |  2 +
 src/include/access/transam.h                |  5 ++
 src/include/catalog/pg_control.h            |  1 +
 src/include/storage/procarray.h             |  3 +-
 src/include/storage/standby.h               |  8 ++-
 src/include/storage/standbydefs.h           |  1 +
 src/test/recovery/t/006_logical_decoding.pl | 93 +++++++++++++++++++++++++++--
 18 files changed, 269 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..d1400ec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use oldestCatalogXmin here */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 278546a..4aaae59 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -21,10 +21,11 @@ standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
 	int			i;
 
-	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
+	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u oldestCatalogXmin %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
-					 xlrec->oldestRunningXid);
+					 xlrec->oldestRunningXid,
+					 xlrec->oldestCatalogXmin);
 	if (xlrec->xcnt > 0)
 	{
 		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..a66cfc6 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -47,7 +47,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
-						 "oldest running xid %u; %s",
+						 "oldest running xid %u; oldest catalog xmin %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -63,6 +63,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
+						 checkpoint->oldestCatalogXmin,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5efbfbd..ffabf1c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -414,6 +414,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or from LogCurrentRunningXacts()
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..cec68b2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5021,6 +5021,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6612,9 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6632,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -6908,6 +6913,8 @@ StartupXLOG(void)
 				running.subxid_overflow = false;
 				running.nextXid = checkPoint.nextXid;
 				running.oldestRunningXid = oldestActiveXID;
+				running.pendingOldestCatalogXmin
+					= checkPoint.oldestCatalogXmin;
 				latestCompletedXid = checkPoint.nextXid;
 				TransactionIdRetreat(latestCompletedXid);
 				Assert(TransactionIdIsNormal(latestCompletedXid));
@@ -8786,7 +8793,16 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+		LogStandbySnapshot(true);
+
+	/*
+	 * We must copy oldestCatalogXmin after the standby snapshot so we get any
+	 * updated value and don't clobber the value written in xl_running_xacts
+	 * with an older one in the checkpoint.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
 
 	START_CRIT_SECTION();
 
@@ -9670,6 +9686,8 @@ xlog_redo(XLogReaderState *record)
 			running.subxid_overflow = false;
 			running.nextXid = checkPoint.nextXid;
 			running.oldestRunningXid = oldestActiveXID;
+			running.pendingOldestCatalogXmin =
+				checkPoint.oldestCatalogXmin;
 			latestCompletedXid = checkPoint.nextXid;
 			TransactionIdRetreat(latestCompletedXid);
 			Assert(TransactionIdIsNormal(latestCompletedXid));
@@ -9729,8 +9747,10 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..258c955 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -330,7 +330,7 @@ BackgroundWriterMain(void)
 			if (now >= timeout &&
 				last_snapshot_lsn < GetLastImportantRecPtr())
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				last_snapshot_lsn = LogStandbySnapshot(false);
 				last_snapshot_ts = now;
 			}
 		}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6c5ec7a..605990f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -947,7 +947,7 @@ ReplicationSlotReserveWal(void)
 			slot->data.restart_lsn = GetXLogInsertRecPtr();
 
 			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
+			flushptr = LogStandbySnapshot(false);
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index df93265..277f196 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1234,7 +1234,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..7c46e24 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1778,6 +1778,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..898a5ca 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,11 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -679,6 +683,15 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
 
 	/*
+	 * Update our knowledge of the oldest xid we can safely create historic
+	 * snapshots for.
+	 *
+	 * If we allow logical decoding on standbys in future we must raise
+	 * recovery conflicts with catalog_xmin advances here.
+	 */
+	SetOldestCatalogXmin(running->pendingOldestCatalogXmin);
+
+	/*
 	 * Remove stale locks, if any.
 	 *
 	 * Locks are always assigned to the toplevel xid so we don't need to care
@@ -1306,6 +1319,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1700,7 +1716,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1727,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2041,12 +2060,16 @@ GetRunningTransactionData(void)
 	}
 
 	/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
 	 * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
 	 * were to be included here the initial value could never increase because
-	 * of a circular dependency where slots only increase their limits when
-	 * running xacts increases oldestRunningXid and running xacts only
+	 * of a circular dependency where slots only increase their xmin limits
+	 * when running xacts increases oldestRunningXid and running xacts only
 	 * increases if slots do.
+	 *
+	 * We can safely report the catalog_xmin limit for replication slots here
+	 * because it's only used to advance oldestCatalogXmin. Slots' catalog_xmin
+	 * advance does not depend on it so there's no circularity.
 	 */
 
 	CurrentRunningXacts->xcnt = count - subcount;
@@ -2055,6 +2078,8 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->nextXid = ShmemVariableCache->nextXid;
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
+	CurrentRunningXacts->pendingOldestCatalogXmin =
+		procArray->replication_slot_catalog_xmin;
 
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
@@ -2171,6 +2196,13 @@ GetOldestSafeDecodingTransactionId(void)
 	 * If there's already a slot pegging the xmin horizon, we can start with
 	 * that value, it's guaranteed to be safe since it's computed by this
 	 * routine initially and has been enforced since.
+	 *
+	 * We don't use ShmemVariableCache->oldestCatalogXmin here because another
+	 * backend may have already logged its intention to advance it to a higher
+	 * value (still <= replication_slot_catalog_xmin) and just be waiting on
+	 * ProcArrayLock to actually apply the change. On a standby
+	 * replication_slot_catalog_xmin is what the walreceiver will be sending in
+	 * hot_standby_feedback, not oldestCatalogXmin.
 	 */
 	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
 		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
@@ -2965,18 +2997,31 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmins, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ *
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs, outdated replicas sending
+ * feedback, etc.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f93..7f73180 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -45,6 +45,7 @@ static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlis
 static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason);
 static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
+static void UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin);
 
 
 /*
@@ -822,6 +823,7 @@ standby_redo(XLogReaderState *record)
 		running.latestCompletedXid = xlrec->latestCompletedXid;
 		running.oldestRunningXid = xlrec->oldestRunningXid;
 		running.xids = xlrec->xids;
+		running.pendingOldestCatalogXmin = xlrec->oldestCatalogXmin;
 
 		ProcArrayApplyRecoveryInfo(&running);
 	}
@@ -906,7 +908,7 @@ standby_redo(XLogReaderState *record)
  * Returns the RecPtr of the last inserted record.
  */
 XLogRecPtr
-LogStandbySnapshot(void)
+LogStandbySnapshot(bool in_checkpoint)
 {
 	XLogRecPtr	recptr;
 	RunningTransactions running;
@@ -924,6 +926,17 @@ LogStandbySnapshot(void)
 	pfree(locks);
 
 	/*
+	 * We must lock out concurrent checkpoints so that a checkpoint doesn't
+	 * copy oldestCatalogXmin after we've written a pending new value to xlog
+	 * but before we've updated it in shmem. Otherwise a standby will get the
+	 * old value after replaying the checkpoint.
+	 */
+	if (in_checkpoint)
+		Assert(LWLockHeldByMe(CheckpointLock));
+	else
+		LWLockAcquire(CheckpointLock, LW_SHARED);
+
+	/*
 	 * Log details of all in-progress transactions. This should be the last
 	 * record we write, because standby will open up when it sees this.
 	 */
@@ -953,12 +966,29 @@ LogStandbySnapshot(void)
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
+	/*
+	 * Now that we've recorded our intention to allow cleanup of catalog tuples
+	 * no longer needed by our replication slots we can make the new threshold
+	 * effective for vacuum etc.
+	 */
+	UpdateOldestCatalogXmin(running->pendingOldestCatalogXmin);
+
+	if (!in_checkpoint)
+		LWLockRelease(CheckpointLock);
+
 	return recptr;
 }
 
 /*
  * Record an enhanced snapshot of running transactions into WAL.
  *
+ * We also record the value of procArray->replication_slot_catalog_xmin
+ * obtained from GetRunningTransactionData here. We intend to advance
+ * ShmemVariableCache->oldestCatalogXmin to it once standbys have been informed
+ * of the new value, which will permit removal of previously-protected dead
+ * catalog tuples. The standby needs to know about that before any WAL
+ * records containing such tuple removals could possibly arrive.
+ *
  * The definitions of RunningTransactionsData and xl_xact_running_xacts are
  * similar. We keep them separate because xl_xact_running_xacts is a
  * contiguous chunk of memory and never exists fully until it is assembled in
@@ -977,6 +1007,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
+	xlrec.oldestCatalogXmin = CurrRunningXacts->pendingOldestCatalogXmin;
 
 	/* Header */
 	XLogBeginInsert();
@@ -992,20 +1023,22 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 
 	if (CurrRunningXacts->subxid_overflow)
 		elog(trace_recovery(DEBUG2),
-			 "snapshot of %u running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+			 "snapshot of %u running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u oldestCatalogXmin %u)",
 			 CurrRunningXacts->xcnt,
 			 (uint32) (recptr >> 32), (uint32) recptr,
 			 CurrRunningXacts->oldestRunningXid,
 			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+			 CurrRunningXacts->nextXid,
+			 CurrRunningXacts->pendingOldestCatalogXmin);
 	else
 		elog(trace_recovery(DEBUG2),
-			 "snapshot of %u+%u running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+			 "snapshot of %u+%u running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u oldestCatalogXmin %u)",
 			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
 			 (uint32) (recptr >> 32), (uint32) recptr,
 			 CurrRunningXacts->oldestRunningXid,
 			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+			 CurrRunningXacts->nextXid,
+			 CurrRunningXacts->pendingOldestCatalogXmin);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
@@ -1022,6 +1055,23 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 }
 
 /*
+ * Advance the oldestCatalogXmin used for removal of dead catalog tuples to the
+ * lowest catalog_xmin threshold of any current replication slots.
+ *
+ * Should only be called during a checkpoint or after writing an xlog record to
+ * record the advance.
+ */
+static void
+UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	if ((TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin))
+		|| TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin))
+		ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
+
+/*
  * Wholesale logging of AccessExclusiveLocks. Other lock types need not be
  * logged, as described in backend/storage/lmgr/README.
  */
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..0123fc8 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+	TransactionId oldestCatalogXmin;	/* oldestCatalogXmin guarantees that no
+										 * valid catalog tuples >= than it are
+										 * removed. That property is used for
+										 * logical decoding. */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -179,6 +183,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..1fe89ae 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..05ace64 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,7 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..4728fa5 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -65,6 +65,10 @@ extern void StandbyReleaseOldLocks(int nxids, TransactionId *xids);
  * is written to WAL as a separate record immediately after each
  * checkpoint. That means that wherever we start a standby from we will
  * almost immediately see the data we need to begin executing queries.
+ *
+ * Information about the oldest catalog_xmin needed by any replication slot is
+ * also included here, so we can use it to update the catalog tuple removal
+ * limit and convey the new limit to standbys.
  */
 
 typedef struct RunningTransactionsData
@@ -75,6 +79,8 @@ typedef struct RunningTransactionsData
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	/* so we can update ShmemVariableCache->oldestCatalogXmin: */
+	TransactionId pendingOldestCatalogXmin;
 
 	TransactionId *xids;		/* array of (sub)xids still running */
 } RunningTransactionsData;
@@ -84,7 +90,7 @@ typedef RunningTransactionsData *RunningTransactions;
 extern void LogAccessExclusiveLock(Oid dbOid, Oid relOid);
 extern void LogAccessExclusiveLockPrepare(void);
 
-extern XLogRecPtr LogStandbySnapshot(void);
+extern XLogRecPtr LogStandbySnapshot(bool in_checkpoint);
 extern void LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 						bool relcacheInitFileInval);
 
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index f8444c7..6153675 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -52,6 +52,7 @@ typedef struct xl_running_xacts
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
+	TransactionId oldestCatalogXmin;	/* oldest safe historic snapshot */
 
 	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..a1c92c8 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,24 +7,79 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 44;
 
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
-$node_master->append_conf(
-		'postgresql.conf', qq(
+$node_master->append_conf('postgresql.conf', qq(
 wal_level = logical
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+log_min_messages = debug1
 ));
 $node_master->start;
-my $backup_name = 'master_backup';
 
+# Set up some changes before we make base backups
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
 $node_master->safe_psql('postgres', qq[SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');]);
 
 $node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
 
+# Launch two streaming replicas, one with and one without
+# physical replication slots. We'll use these for tests
+# involving interaction of logical and physical standby.
+#
+# Both backups are created with pg_basebackup.
+#
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+$node_master->safe_psql('postgres', q[SELECT pg_create_physical_replication_slot('slot_replica');]);
+my $node_slot_replica = get_new_node('slot_replica');
+$node_slot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_slot_replica->append_conf('recovery.conf', "primary_slot_name = 'slot_replica'");
+
+my $node_noslot_replica = get_new_node('noslot_replica');
+$node_noslot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+
+$node_slot_replica->start;
+$node_noslot_replica->start;
+
+sub restartpoint_standbys
+{
+	# Force restartpoints to update control files on replicas
+	$node_slot_replica->safe_psql('postgres', 'CHECKPOINT');
+	$node_noslot_replica->safe_psql('postgres', 'CHECKPOINT');
+}
+
+sub wait_standbys
+{
+	my $lsn = $node_master->lsn('insert');
+	$node_master->wait_for_catchup($node_noslot_replica, 'replay', $lsn);
+	$node_master->wait_for_catchup($node_slot_replica, 'replay', $lsn);
+}
+
+# pg_basebackup doesn't copy replication slots
+is($node_slot_replica->slot('test_slot')->{'slot_name'}, undef,
+	'logical slot test_slot on master not copied by pg_basebackup');
+
+# Make sure oldestCatalogXmin lands in the control file on master
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my @nodes = ($node_master, $node_slot_replica, $node_noslot_replica);
+
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	# Master had an oldestCatalogXmin, so we must've inherited it via checkpoint
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero after start on " . $node->name);
+}
+
 # Basic decoding works
 my($result) = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
 is(scalar(my @foobar = split /^/m, $result), 12, 'Decoding produced 12 rows inc BEGIN/COMMIT');
@@ -64,6 +119,9 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+# Create a second DB we'll use for testing dropping and accessing slots across
+# databases. This matters since logical slots are globally visible objects that
+# can only actually be used on one DB for most purposes.
 $node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
 
 is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
@@ -96,9 +154,32 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero on " . $node->name);
+}
+
+# Dropping the slot must clear catalog_xmin
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+# First checkpoint forces xl_running_xacts with the new oldestCatalogXmin
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+# Then we need a second checkpoint to write the control file with the new value
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+		"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint on " . $node->name);
+}
 
-# done with the node
-$node_master->stop;
+foreach my $node (@nodes)
+{
+	$node->stop;
+}
-- 
2.5.5

#94Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#93)
1 attachment(s)
Re: Logical decoding on standby

On 3 April 2017 at 15:27, Craig Ringer <craig@2ndquadrant.com> wrote:

On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote:

OK, updated catalog_xmin logging patch attached.

Ahem, that should be v5.

... and here's v6, which returns to the separate
xl_xact_catalog_xmin_advance approach.

pgintented.

This is what I favour proceeding with.

Now updating/amending recovery conflict patch.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

log-catalog-xmin-advances-v6.patchtext/x-patch; charset=US-ASCII; name=log-catalog-xmin-advances-v6.patchDownload
From 353e987584e22d268b8ab1c10c46d7e8c74ef552 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:36:49 +0800
Subject: [PATCH] Log catalog_xmin advances before removing catalog tuples

Before advancing the effective catalog_xmin we use to remove old catalog
tuple versions, make sure it is written to WAL. This allows standbys
to know the oldest xid they can safely create a historic snapshot for.
They can then refuse to start decoding from a slot or raise a recovery
conflict.

The catalog_xmin advance is logged in a new xl_catalog_xmin_advance record,
emitted before vacuum or periodically by the bgwriter. WAL is only written if
the lowest catalog_xmin needed by any replication slot has advanced.
---
 src/backend/access/heap/rewriteheap.c       |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c      |   9 ++
 src/backend/access/rmgrdesc/xlogdesc.c      |   3 +-
 src/backend/access/transam/varsup.c         |  15 ++++
 src/backend/access/transam/xact.c           |  36 ++++++++
 src/backend/access/transam/xlog.c           |  20 ++++-
 src/backend/commands/vacuum.c               |   9 ++
 src/backend/postmaster/bgwriter.c           |  10 +++
 src/backend/replication/logical/decode.c    |  12 +++
 src/backend/replication/walreceiver.c       |   2 +-
 src/backend/replication/walsender.c         |   8 ++
 src/backend/storage/ipc/procarray.c         | 132 ++++++++++++++++++++++++++--
 src/bin/pg_controldata/pg_controldata.c     |   2 +
 src/include/access/transam.h                |   5 ++
 src/include/access/xact.h                   |  12 ++-
 src/include/catalog/pg_control.h            |   1 +
 src/include/storage/procarray.h             |   5 +-
 src/test/recovery/t/006_logical_decoding.pl |  90 +++++++++++++++++--
 18 files changed, 353 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..d1400ec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use oldestCatalogXmin here */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..96ea163 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..a66cfc6 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -47,7 +47,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
-						 "oldest running xid %u; %s",
+						 "oldest running xid %u; oldest catalog xmin %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -63,6 +63,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
+						 checkpoint->oldestCatalogXmin,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5efbfbd..ffabf1c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -414,6 +414,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or from LogCurrentRunningXacts()
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..63453d7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5652,6 +5652,42 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 *
+		 * Existing sessions are not notified and must check the safe xmin.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	xl_xact_catalog_xmin_advance xlrec;
+
+	xlrec.new_catalog_xmin = new_catalog_xmin;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+	return XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..b53d7ef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5021,6 +5021,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6612,12 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6635,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8726,6 +8734,10 @@ CreateCheckPoint(int flags)
 							 &checkPoint.oldestMulti,
 							 &checkPoint.oldestMultiDB);
 
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+
 	/*
 	 * Having constructed the checkpoint record, ensure all shmem disk buffers
 	 * and commit-log buffers are flushed to disk.
@@ -9632,6 +9644,8 @@ xlog_redo(XLogReaderState *record)
 		 */
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
 		 * record, the backup was canceled and the end-of-backup record will
@@ -9729,8 +9743,10 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 9fbb0eb..ae41dc3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId safeMxactLimit;
 
 	/*
+	 * When logical decoding is enabled, we must write any advance of
+	 * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
+	 * This ensures that any standbys doing logical decoding can cancel
+	 * decoding sessions and invalidate slots if we remove tuples they
+	 * still need.
+	 */
+	UpdateOldestCatalogXmin();
+
+	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
 	 * use these values only for deciding which tuples we must keep in the
 	 * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..2bed256 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -295,6 +296,15 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Eagerly advance the catalog_xmin used by vacuum if we're not a
+		 * standby. This ensures that standbys waiting for catalog_xmin
+		 * confirmation receive it promptly, even if we haven't had a recent
+		 * vacuum run.
+		 */
+		if (!RecoveryInProgress())
+			UpdateOldestCatalogXmin();
+
+		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
 		 * snapshots) and clean up resources (locks, KnownXids*) more
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..b5084b9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,18 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index df93265..277f196 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1234,7 +1234,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index cfc3fba..7c46e24 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1778,6 +1778,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..6deb169 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,12 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -1306,6 +1311,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1444,6 +1452,87 @@ GetOldestXmin(Relation rel, int flags)
 }
 
 /*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by replication slots to
+ * the effective catalog_xmin used for dead tuple removal and write a WAL
+ * record recording the change.
+ *
+ * This allows standbys to know the oldest xid for which it is safe to create
+ * a historic snapshot for logical decoding. VACUUM or other cleanup may have
+ * removed catalog tuple versions needed to correctly decode transactions older
+ * than this threshold. Standbys can use this information to cancel conflicting
+ * decoding sessions and invalidate slots that need discarded information.
+ *
+ * (We can't use the transaction IDs in WAL records emitted by VACUUM etc for
+ * this, since they don't identify the relation as a catalog or not.  Nor can a
+ * standby look up the relcache to get the Relation for the affected
+ * relfilenode to check if it is a catalog. The standby would also have no way
+ * to know the oldest safe position at startup if it wasn't in the control
+ * file.)
+ */
+void
+UpdateOldestCatalogXmin(void)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	Assert(XLogInsertAllowed());
+
+	/*
+	 * It's most likely that replication_slot_catalog_xmin and
+	 * oldestCatalogXmin will be the same and no action is required, so do a
+	 * pre-check before doing expensive WAL writing and exclusive locking.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	vacuum_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+	slots_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	LWLockRelease(ProcArrayLock);
+
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+	{
+		/*
+		 * We must prevent a concurrent checkpoint, otherwise the catalog xmin
+		 * advance xlog record with the new value might be written before the
+		 * checkpoint but the checkpoint may still see the old
+		 * oldestCatalogXmin value.
+		 */
+		LWLockAcquire(CheckpointLock, LW_SHARED);
+
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		/*
+		 * A concurrent updater could've changed the oldestCatalogXmin so we
+		 * need to re-check under ProcArrayLock before updating. The LWLock
+		 * provides a barrier.
+		 *
+		 * We must not re-read replication_slot_catalog_xmin even if it has
+		 * advanced, since we xlog'd the older value. If it advanced since, a
+		 * later run will xlog the new value and advance.
+		 */
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		vacuum_catalog_xmin = *((volatile TransactionId *) &ShmemVariableCache->oldestCatalogXmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			ShmemVariableCache->oldestCatalogXmin = slots_catalog_xmin;
+		LWLockRelease(ProcArrayLock);
+
+		LWLockRelease(CheckpointLock);
+	}
+
+}
+
+/*
  * GetMaxSnapshotXidCount -- get max size for snapshot XID array
  *
  * We have to export this for use by snapmgr.c.
@@ -1700,7 +1789,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1800,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2041,12 +2133,16 @@ GetRunningTransactionData(void)
 	}
 
 	/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
 	 * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
 	 * were to be included here the initial value could never increase because
-	 * of a circular dependency where slots only increase their limits when
-	 * running xacts increases oldestRunningXid and running xacts only
+	 * of a circular dependency where slots only increase their xmin limits
+	 * when running xacts increases oldestRunningXid and running xacts only
 	 * increases if slots do.
+	 *
+	 * We can safely report the catalog_xmin limit for replication slots here
+	 * because it's only used to advance oldestCatalogXmin. Slots'
+	 * catalog_xmin advance does not depend on it so there's no circularity.
 	 */
 
 	CurrentRunningXacts->xcnt = count - subcount;
@@ -2171,6 +2267,13 @@ GetOldestSafeDecodingTransactionId(void)
 	 * If there's already a slot pegging the xmin horizon, we can start with
 	 * that value, it's guaranteed to be safe since it's computed by this
 	 * routine initially and has been enforced since.
+	 *
+	 * We don't use ShmemVariableCache->oldestCatalogXmin here because another
+	 * backend may have already logged its intention to advance it to a higher
+	 * value (still <= replication_slot_catalog_xmin) and just be waiting on
+	 * ProcArrayLock to actually apply the change. On a standby
+	 * replication_slot_catalog_xmin is what the walreceiver will be sending
+	 * in hot_standby_feedback, not oldestCatalogXmin.
 	 */
 	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
 		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
@@ -2965,18 +3068,31 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmins, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ *
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs, outdated replicas sending
+ * feedback, etc.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..c2cb0a1 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+	TransactionId oldestCatalogXmin;	/* oldestCatalogXmin guarantees that
+										 * no valid catalog tuples >= than it
+										 * are removed. That property is used
+										 * for logical decoding. */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -179,6 +183,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..6d18d18 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -137,7 +137,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -187,6 +187,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+}	xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -391,6 +398,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   int xactflags, TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..1fe89ae 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..69a82d7 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(void);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..80b976b 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,24 +7,79 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 44;
 
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
-$node_master->append_conf(
-		'postgresql.conf', qq(
+$node_master->append_conf('postgresql.conf', qq(
 wal_level = logical
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+log_min_messages = debug1
 ));
 $node_master->start;
-my $backup_name = 'master_backup';
 
+# Set up some changes before we make base backups
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
 $node_master->safe_psql('postgres', qq[SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');]);
 
 $node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
 
+# Launch two streaming replicas, one with and one without
+# physical replication slots. We'll use these for tests
+# involving interaction of logical and physical standby.
+#
+# Both backups are created with pg_basebackup.
+#
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+$node_master->safe_psql('postgres', q[SELECT pg_create_physical_replication_slot('slot_replica');]);
+my $node_slot_replica = get_new_node('slot_replica');
+$node_slot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_slot_replica->append_conf('recovery.conf', "primary_slot_name = 'slot_replica'");
+
+my $node_noslot_replica = get_new_node('noslot_replica');
+$node_noslot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+
+$node_slot_replica->start;
+$node_noslot_replica->start;
+
+sub restartpoint_standbys
+{
+	# Force restartpoints to update control files on replicas
+	$node_slot_replica->safe_psql('postgres', 'CHECKPOINT');
+	$node_noslot_replica->safe_psql('postgres', 'CHECKPOINT');
+}
+
+sub wait_standbys
+{
+	my $lsn = $node_master->lsn('insert');
+	$node_master->wait_for_catchup($node_noslot_replica, 'replay', $lsn);
+	$node_master->wait_for_catchup($node_slot_replica, 'replay', $lsn);
+}
+
+# pg_basebackup doesn't copy replication slots
+is($node_slot_replica->slot('test_slot')->{'slot_name'}, undef,
+	'logical slot test_slot on master not copied by pg_basebackup');
+
+# Make sure oldestCatalogXmin lands in the control file on master
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my @nodes = ($node_master, $node_slot_replica, $node_noslot_replica);
+
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	# Master had an oldestCatalogXmin, so we must've inherited it via checkpoint
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero after start on " . $node->name);
+}
+
 # Basic decoding works
 my($result) = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
 is(scalar(my @foobar = split /^/m, $result), 12, 'Decoding produced 12 rows inc BEGIN/COMMIT');
@@ -64,6 +119,9 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+# Create a second DB we'll use for testing dropping and accessing slots across
+# databases. This matters since logical slots are globally visible objects that
+# can only actually be used on one DB for most purposes.
 $node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
 
 is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
@@ -96,9 +154,29 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero on " . $node->name);
+}
+
+# Dropping the slot must clear catalog_xmin
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+		"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint on " . $node->name);
+}
 
-# done with the node
-$node_master->stop;
+foreach my $node (@nodes)
+{
+	$node->stop;
+}
-- 
2.5.5

#95Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#94)
Re: Logical decoding on standby

Hi all

Here's the final set of three patches on top of what's already committed.

The first is catalog_xmin logging, which is unchanged from the prior post.

The 2nd is support for conflict with recovery, with changes that
should address Andres's concerns there.

The 3rd actually enables decoding on standby. Unlike the prior
version, no attempt is made to check the walsender configuration for
slot use, etc. The ugly code to try to mitigate races is also removed.
Instead, if wal_level is logical the catalog_xmin sent by
hot_standby_feedback is now the same as the xmin if there's no local
slot holding it down. So we're always sending a catalog_xmin in
feedback and we should always expect to have a valid local
oldestCatalogXmin once hot_standby_feedback kicks in. This makes the
race in slot creation no worse than the existing race between
hot_standby_feedback establishment and the first queries run on a
downstream, albeit with more annoying consequences. Apps can still
ensure a slot created on standby is guaranteed safe and conflict-free
by having a slot on the master first.

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#95)
3 attachment(s)
Re: Logical decoding on standby

On 4 April 2017 at 22:32, Craig Ringer <craig@2ndquadrant.com> wrote:

Hi all

Here's the final set of three patches on top of what's already committed.

The first is catalog_xmin logging, which is unchanged from the prior post.

The 2nd is support for conflict with recovery, with changes that
should address Andres's concerns there.

The 3rd actually enables decoding on standby. Unlike the prior
version, no attempt is made to check the walsender configuration for
slot use, etc. The ugly code to try to mitigate races is also removed.
Instead, if wal_level is logical the catalog_xmin sent by
hot_standby_feedback is now the same as the xmin if there's no local
slot holding it down. So we're always sending a catalog_xmin in
feedback and we should always expect to have a valid local
oldestCatalogXmin once hot_standby_feedback kicks in. This makes the
race in slot creation no worse than the existing race between
hot_standby_feedback establishment and the first queries run on a
downstream, albeit with more annoying consequences. Apps can still
ensure a slot created on standby is guaranteed safe and conflict-free
by having a slot on the master first.

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

Dammit. Attached.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

01-log-catalog-xmin-advances-v6.patchtext/x-patch; charset=US-ASCII; name=01-log-catalog-xmin-advances-v6.patchDownload
From 9b8b1236eb32819430062031ff76750ed8bc1661 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:36:49 +0800
Subject: [PATCH 1/3] Log catalog_xmin advances before removing catalog tuples

Before advancing the effective catalog_xmin we use to remove old catalog
tuple versions, make sure it is written to WAL. This allows standbys
to know the oldest xid they can safely create a historic snapshot for.
They can then refuse to start decoding from a slot or raise a recovery
conflict.

The catalog_xmin advance is logged in a new xl_catalog_xmin_advance record,
emitted before vacuum or periodically by the bgwriter. WAL is only written if
the lowest catalog_xmin needed by any replication slot has advanced.
---
 src/backend/access/heap/rewriteheap.c       |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c      |   9 ++
 src/backend/access/rmgrdesc/xlogdesc.c      |   3 +-
 src/backend/access/transam/varsup.c         |  15 ++++
 src/backend/access/transam/xact.c           |  36 ++++++++
 src/backend/access/transam/xlog.c           |  23 ++++-
 src/backend/postmaster/bgwriter.c           |   9 ++
 src/backend/replication/logical/decode.c    |  12 +++
 src/backend/replication/walreceiver.c       |   2 +-
 src/backend/replication/walsender.c         |   8 ++
 src/backend/storage/ipc/procarray.c         | 134 ++++++++++++++++++++++++++--
 src/bin/pg_controldata/pg_controldata.c     |   2 +
 src/include/access/transam.h                |   5 ++
 src/include/access/xact.h                   |  12 ++-
 src/include/catalog/pg_control.h            |   1 +
 src/include/storage/procarray.h             |   5 +-
 src/test/recovery/t/006_logical_decoding.pl |  90 +++++++++++++++++--
 17 files changed, 348 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..d1400ec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use oldestCatalogXmin here */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..96ea163 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..a66cfc6 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -47,7 +47,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
-						 "oldest running xid %u; %s",
+						 "oldest running xid %u; oldest catalog xmin %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -63,6 +63,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
+						 checkpoint->oldestCatalogXmin,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5efbfbd..ffabf1c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -414,6 +414,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or from LogCurrentRunningXacts()
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..63453d7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5652,6 +5652,42 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 *
+		 * Existing sessions are not notified and must check the safe xmin.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	xl_xact_catalog_xmin_advance xlrec;
+
+	xlrec.new_catalog_xmin = new_catalog_xmin;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+	return XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..8d713e9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5021,6 +5021,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6612,12 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6635,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8537,6 +8545,9 @@ CreateCheckPoint(int flags)
 	 */
 	InitXLogInsert();
 
+	/* Checkpoints are a handy time to update the effective catalog_xmin */
+	UpdateOldestCatalogXmin();
+
 	/*
 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
 	 * (This is just pro forma, since in the present system structure there is
@@ -8726,6 +8737,10 @@ CreateCheckPoint(int flags)
 							 &checkPoint.oldestMulti,
 							 &checkPoint.oldestMultiDB);
 
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+
 	/*
 	 * Having constructed the checkpoint record, ensure all shmem disk buffers
 	 * and commit-log buffers are flushed to disk.
@@ -9632,6 +9647,8 @@ xlog_redo(XLogReaderState *record)
 		 */
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
 		 * record, the backup was canceled and the end-of-backup record will
@@ -9729,8 +9746,10 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..3bb5200 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -333,6 +334,14 @@ BackgroundWriterMain(void)
 				last_snapshot_lsn = LogStandbySnapshot();
 				last_snapshot_ts = now;
 			}
+
+			/*
+			 * We can also advance the threshold used for catalog tuple
+			 * cleanup, rate-limited so we don't write it too often. The delay
+			 * slightly increases catalog bloat but reduces the volume of
+			 * catalog_xmin advance records written.
+			 */
+			UpdateOldestCatalogXmin();
 		}
 
 		/*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..b5084b9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,18 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index df93265..277f196 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1234,7 +1234,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dbb10c7..9f3a86b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1778,6 +1778,14 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..9e98af8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,12 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -1306,6 +1311,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1444,6 +1452,89 @@ GetOldestXmin(Relation rel, int flags)
 }
 
 /*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by replication slots to
+ * the effective catalog_xmin used for dead tuple removal and write a WAL
+ * record recording the change.
+ *
+ * This allows standbys to know the oldest xid for which it is safe to create
+ * a historic snapshot for logical decoding. VACUUM or other cleanup may have
+ * removed catalog tuple versions needed to correctly decode transactions older
+ * than this threshold. Standbys can use this information to cancel conflicting
+ * decoding sessions and invalidate slots that need discarded information.
+ *
+ * (We can't use the transaction IDs in WAL records emitted by VACUUM etc for
+ * this, since they don't identify the relation as a catalog or not.  Nor can a
+ * standby look up the relcache to get the Relation for the affected
+ * relfilenode to check if it is a catalog. The standby would also have no way
+ * to know the oldest safe position at startup if it wasn't in the control
+ * file.)
+ */
+void
+UpdateOldestCatalogXmin(void)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	Assert(XLogInsertAllowed());
+
+	/*
+	 * It's most likely that replication_slot_catalog_xmin and
+	 * oldestCatalogXmin will be the same and no action is required, so do a
+	 * pre-check before doing expensive WAL writing and exclusive locking.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	vacuum_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+	slots_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	LWLockRelease(ProcArrayLock);
+
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+	{
+		/*
+		 * We must prevent a concurrent checkpoint, otherwise the catalog xmin
+		 * advance xlog record with the new value might be written before the
+		 * checkpoint but the checkpoint may still see the old
+		 * oldestCatalogXmin value.
+		 */
+		if (!LWLockConditionalAcquire(CheckpointLock, LW_SHARED))
+			/* Couldn't get checkpointer lock; will retry later */
+			return;
+
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		/*
+		 * A concurrent updater could've changed the oldestCatalogXmin so we
+		 * need to re-check under ProcArrayLock before updating. The LWLock
+		 * provides a barrier.
+		 *
+		 * We must not re-read replication_slot_catalog_xmin even if it has
+		 * advanced, since we xlog'd the older value. If it advanced since, a
+		 * later run will xlog the new value and advance.
+		 */
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		vacuum_catalog_xmin = *((volatile TransactionId *) &ShmemVariableCache->oldestCatalogXmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			ShmemVariableCache->oldestCatalogXmin = slots_catalog_xmin;
+		LWLockRelease(ProcArrayLock);
+
+		LWLockRelease(CheckpointLock);
+	}
+
+}
+
+/*
  * GetMaxSnapshotXidCount -- get max size for snapshot XID array
  *
  * We have to export this for use by snapmgr.c.
@@ -1700,7 +1791,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1802,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2041,12 +2135,16 @@ GetRunningTransactionData(void)
 	}
 
 	/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
 	 * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
 	 * were to be included here the initial value could never increase because
-	 * of a circular dependency where slots only increase their limits when
-	 * running xacts increases oldestRunningXid and running xacts only
+	 * of a circular dependency where slots only increase their xmin limits
+	 * when running xacts increases oldestRunningXid and running xacts only
 	 * increases if slots do.
+	 *
+	 * We can safely report the catalog_xmin limit for replication slots here
+	 * because it's only used to advance oldestCatalogXmin. Slots'
+	 * catalog_xmin advance does not depend on it so there's no circularity.
 	 */
 
 	CurrentRunningXacts->xcnt = count - subcount;
@@ -2171,6 +2269,13 @@ GetOldestSafeDecodingTransactionId(void)
 	 * If there's already a slot pegging the xmin horizon, we can start with
 	 * that value, it's guaranteed to be safe since it's computed by this
 	 * routine initially and has been enforced since.
+	 *
+	 * We don't use ShmemVariableCache->oldestCatalogXmin here because another
+	 * backend may have already logged its intention to advance it to a higher
+	 * value (still <= replication_slot_catalog_xmin) and just be waiting on
+	 * ProcArrayLock to actually apply the change. On a standby
+	 * replication_slot_catalog_xmin is what the walreceiver will be sending
+	 * in hot_standby_feedback, not oldestCatalogXmin.
 	 */
 	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
 		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
@@ -2965,18 +3070,31 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmins, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ *
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs, outdated replicas sending
+ * feedback, etc.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..c2cb0a1 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+	TransactionId oldestCatalogXmin;	/* oldestCatalogXmin guarantees that
+										 * no valid catalog tuples >= than it
+										 * are removed. That property is used
+										 * for logical decoding. */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -179,6 +183,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..6d18d18 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -137,7 +137,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -187,6 +187,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+}	xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -391,6 +398,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   int xactflags, TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..1fe89ae 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..69a82d7 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(void);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..80b976b 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,24 +7,79 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 44;
 
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
-$node_master->append_conf(
-		'postgresql.conf', qq(
+$node_master->append_conf('postgresql.conf', qq(
 wal_level = logical
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+log_min_messages = debug1
 ));
 $node_master->start;
-my $backup_name = 'master_backup';
 
+# Set up some changes before we make base backups
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
 $node_master->safe_psql('postgres', qq[SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');]);
 
 $node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
 
+# Launch two streaming replicas, one with and one without
+# physical replication slots. We'll use these for tests
+# involving interaction of logical and physical standby.
+#
+# Both backups are created with pg_basebackup.
+#
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+$node_master->safe_psql('postgres', q[SELECT pg_create_physical_replication_slot('slot_replica');]);
+my $node_slot_replica = get_new_node('slot_replica');
+$node_slot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_slot_replica->append_conf('recovery.conf', "primary_slot_name = 'slot_replica'");
+
+my $node_noslot_replica = get_new_node('noslot_replica');
+$node_noslot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+
+$node_slot_replica->start;
+$node_noslot_replica->start;
+
+sub restartpoint_standbys
+{
+	# Force restartpoints to update control files on replicas
+	$node_slot_replica->safe_psql('postgres', 'CHECKPOINT');
+	$node_noslot_replica->safe_psql('postgres', 'CHECKPOINT');
+}
+
+sub wait_standbys
+{
+	my $lsn = $node_master->lsn('insert');
+	$node_master->wait_for_catchup($node_noslot_replica, 'replay', $lsn);
+	$node_master->wait_for_catchup($node_slot_replica, 'replay', $lsn);
+}
+
+# pg_basebackup doesn't copy replication slots
+is($node_slot_replica->slot('test_slot')->{'slot_name'}, undef,
+	'logical slot test_slot on master not copied by pg_basebackup');
+
+# Make sure oldestCatalogXmin lands in the control file on master
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my @nodes = ($node_master, $node_slot_replica, $node_noslot_replica);
+
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	# Master had an oldestCatalogXmin, so we must've inherited it via checkpoint
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero after start on " . $node->name);
+}
+
 # Basic decoding works
 my($result) = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
 is(scalar(my @foobar = split /^/m, $result), 12, 'Decoding produced 12 rows inc BEGIN/COMMIT');
@@ -64,6 +119,9 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+# Create a second DB we'll use for testing dropping and accessing slots across
+# databases. This matters since logical slots are globally visible objects that
+# can only actually be used on one DB for most purposes.
 $node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
 
 is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
@@ -96,9 +154,29 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero on " . $node->name);
+}
+
+# Dropping the slot must clear catalog_xmin
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+		"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint on " . $node->name);
+}
 
-# done with the node
-$node_master->stop;
+foreach my $node (@nodes)
+{
+	$node->stop;
+}
-- 
2.5.5

02-decoding-recovery-conflicts-v6.patchtext/x-patch; charset=US-ASCII; name=02-decoding-recovery-conflicts-v6.patchDownload
From c62acf7789a4d1a3db666fa1b2f67ba69af1f237 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 3 Apr 2017 17:31:19 +0800
Subject: [PATCH 2/3] Support conflict with standby on logical walsender

---
 src/backend/access/heap/heapam.c          |   2 +-
 src/backend/access/transam/xact.c         |   6 +-
 src/backend/access/transam/xlog.c         |   3 +-
 src/backend/replication/logical/logical.c | 139 ++++++++++++++++++++++++++++++
 src/backend/replication/slot.c            |   4 +-
 src/backend/replication/walsender.c       |  14 +--
 src/backend/storage/ipc/procarray.c       |  51 +++++++++++
 src/backend/storage/ipc/procsignal.c      |   3 +
 src/backend/storage/ipc/standby.c         |   7 +-
 src/backend/tcop/postgres.c               |  43 ++++++---
 src/include/storage/procarray.h           |   2 +
 src/include/storage/procsignal.h          |   1 +
 src/include/storage/standby.h             |   3 +
 13 files changed, 245 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..93bf143 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7273,7 +7273,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 63453d7..48ca884 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5662,14 +5662,10 @@ xact_redo(XLogReaderState *record)
 		 * notice when we signal them with a recovery conflict. There's no
 		 * effect on the catalogs themselves yet, so it's safe for backends
 		 * with older catalog_xmins to still exist.
-		 *
-		 * We don't have to take ProcArrayLock since only the startup process
-		 * is allowed to change oldestCatalogXmin when we're in recovery.
-		 *
-		 * Existing sessions are not notified and must check the safe xmin.
 		 */
 		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
 
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
 	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d713e9..a98601a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8546,7 +8546,8 @@ CreateCheckPoint(int flags)
 	InitXLogInsert();
 
 	/* Checkpoints are a handy time to update the effective catalog_xmin */
-	UpdateOldestCatalogXmin();
+	if (XLogInsertAllowed())
+		UpdateOldestCatalogXmin();
 
 	/*
 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..282e330 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,8 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -218,6 +224,7 @@ CreateInitDecodingContext(char *plugin,
 	ReplicationSlot *slot;
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
+	bool force_standby_snapshot;
 
 	/* shorter lines... */
 	slot = MyReplicationSlot;
@@ -276,8 +283,21 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	/*
+	 * If this is the first slot created on the master we won't have a
+	 * persistent record of the oldest safe xid for historic snapshots yet.
+	 * Force one to be recorded so that when we go to replay from this slot we
+	 * know it's safe.
+	 */
+	force_standby_snapshot =
+		!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
+
 	LWLockRelease(ProcArrayLock);
 
+	/* Update ShmemVariableCache->oldestCatalogXmin */
+	if (force_standby_snapshot)
+		UpdateOldestCatalogXmin();
+
 	/*
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
@@ -376,6 +396,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+	EnsureActiveLogicalSlotValid();
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId,
 								 read_page, prepare_write, do_write);
@@ -963,3 +985,120 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid(void)
+{
+	TransactionId shmem_catalog_xmin;
+
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * A logical slot can become unusable if we're doing logical decoding on a
+	 * standby or using a slot created before we were promoted from standby
+	 * to master. If the master advanced its global catalog_xmin past the
+	 * threshold we need it could've removed catalog tuple versions that
+	 * we'll require to start decoding at our restart_lsn.
+	 */
+
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	shmem_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+
+	if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+		TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("replication slot '%s' requires catalogs removed by master",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("need catalog_xmin %u, have oldestCatalogXmin %u",
+						   MyReplicationSlot->data.catalog_xmin, shmem_catalog_xmin)));
+}
+
+/*
+ * Scan to see if any clients are using replication slots that are below a
+ * newly-applied new catalog_xmin theshold and signal them to terminate with a
+ * recovery conflict.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and signal its owning backend
+	 * to exit. We'll be called repeatedly by the recovery code until there
+	 * are no more conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but conflicts are the
+		 * problem of the leaf replica with the logical slot.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of active logical
+		 * slots. Anything else gets checked when a new decoding session tries
+		 * to start.
+		 */
+		 while (slot->in_use && slot->active_pid != 0 &&
+				TransactionIdIsValid(slot->effective_catalog_xmin) &&
+				(!TransactionIdIsValid(new_catalog_xmin) ||
+				 TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)))
+		{
+			/*
+			 * We'll be sleeping, so release the control lock. New conflicting
+			 * backends cannot appear and if old ones go away that's what we
+			 * want, so release and re-acquire is OK here.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				/*
+				 * Signal the proc. If the slot is already released or even if
+				 * pid is re-used we don't care, backends are required to
+				 * tolerate spurious recovery signals.
+				 */
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/* Don't flood the system with signals */
+				pg_usleep(10000);
+			}
+
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6c5ec7a..57a3994 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -48,6 +48,7 @@
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/standby.h"
 #include "utils/builtins.h"
 
 /*
@@ -931,7 +932,8 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9f3a86b..ef63b63 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -212,7 +212,6 @@ static struct
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -2831,17 +2830,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2875,7 +2863,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 9e98af8..05e3058 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2762,6 +2762,57 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with newly set
+ * catalog_xmin from the master. We're about to start replaying WAL
+ * that will make its historic snapshot potentially unsafe by removing
+ * system tuples it might need.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+	BackendId	backend_id = InvalidBackendId;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and avoid a ProcSignal scan by SendProcSignal.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+			backend_id = procvxid.backendId;
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Kill the pid if it's still here. If not, that's what we
+	 * wanted so ignore any errors.
+	 */
+	if (backend_id != InvalidBackendId)
+		(void) SendProcSignal(session_pid,
+			PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, backend_id);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4a21d55..16c2e1f 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f93..f6106ca 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,11 +153,13 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
-static bool
+bool
 WaitExceedsMaxStandbyDelay(void)
 {
 	TimestampTz ltime;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a2282058..530dcbe 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2276,6 +2276,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2698,8 +2701,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2781,6 +2788,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2795,12 +2803,18 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
+		 *
+		 * catalog_xmin is non-retryable because once we advance the
+		 * catalog_xmin threshold we might replay wal that removes
+		 * needed catalog tuples. The slot can't (re)start decoding
+		 * because its catalog_xmin cannot be satisifed.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2855,11 +2869,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 69a82d7..231297d 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -112,6 +112,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index d068dde..3a3ba72 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..b17ba6f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,10 +34,13 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
 extern void StandbyLockTimeoutHandler(void);
+extern bool WaitExceedsMaxStandbyDelay(void);
 
 /*
  * Standby Rmgr (RM_STANDBY_ID)
-- 
2.5.5

03-decoding-on-standby-v6.patchtext/x-patch; charset=US-ASCII; name=03-decoding-on-standby-v6.patchDownload
From 854e0f586d6c4f28d02469717122a2997b4410fd Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 4 Apr 2017 11:50:30 +0800
Subject: [PATCH 3/3] Support decoding on standby

---
 src/backend/replication/logical/logical.c          |  61 ++-
 src/backend/replication/walreceiver.c              |  12 +
 src/test/recovery/t/006_logical_decoding.pl        |  70 ++-
 .../recovery/t/012_logical_decoding_on_replica.pl  | 497 +++++++++++++++++++++
 4 files changed, 600 insertions(+), 40 deletions(-)
 create mode 100644 src/test/recovery/t/012_logical_decoding_on_replica.pl

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 282e330..35d110f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -93,23 +93,40 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*----
+		 * We really want to enforce that:
+		 * - we're connected to the primary via a replication slot
+		 * - hot_standby_feedback is enabled
+		 * - the user cannot turn hot_standby_feedback off while we have
+		 *   logical slots on the standby (it's PGC_SIGHUP)
+		 * - hot_standby_feedback has actually taken effect on the master
+		 *
+		 * ... but because the walreceiver doesn't use normal GUCs and may or
+		 * may not actually be running we can't reliably enforce those
+		 * conditions yet. We also have no way of knowing when hot standby
+		 * feedback has reached the master and locked in a catalog_xmin.
+		 *
+		 * So on standbys, slot creation or decoding from a slot may fail with
+		 * a recovery conflict. But we keep track of the master's true
+		 * catalog_xmin in WAL, so we'll never attempt to decode unsafely.
+		 *
+		 * Make a best effort sanity check anyway.
+		 *---
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires hot_standby_feedback = on")));
+
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		if (!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback has not yet taken effect")));
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
@@ -224,7 +241,6 @@ CreateInitDecodingContext(char *plugin,
 	ReplicationSlot *slot;
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
-	bool force_standby_snapshot;
 
 	/* shorter lines... */
 	slot = MyReplicationSlot;
@@ -283,19 +299,16 @@ CreateInitDecodingContext(char *plugin,
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
+	LWLockRelease(ProcArrayLock);
+
 	/*
 	 * If this is the first slot created on the master we won't have a
 	 * persistent record of the oldest safe xid for historic snapshots yet.
 	 * Force one to be recorded so that when we go to replay from this slot we
 	 * know it's safe.
 	 */
-	force_standby_snapshot =
-		!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
-
-	LWLockRelease(ProcArrayLock);
-
-	/* Update ShmemVariableCache->oldestCatalogXmin */
-	if (force_standby_snapshot)
+	if (!RecoveryInProgress() &&
+		!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
 		UpdateOldestCatalogXmin();
 
 	/*
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 277f196..c0f6cec 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1239,6 +1239,18 @@ XLogWalRcvSendHSFeedback(bool immed)
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
 			xmin = slot_xmin;
+
+		/*
+		 * If there's no local catalog_xmin, report it as == xmin, so that
+		 * we lock in a catalog_xmin before we need to create any logical slots
+		 * on this standby. This won't add much catalog bloat until we create
+		 * local slots and catalog_xmin starts lagging behind xmin, but it will
+		 * cause the master to start logging
+		 * xl_xact_catalog_xmin_advance records we need for logical
+		 * decoding on standby.
+		 */
+		if (!TransactionIdIsValid(catalog_xmin) && XLogLogicalInfoActive())
+			catalog_xmin = xmin;
 	}
 	else
 	{
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 80b976b..88ddf00 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 44;
+use Test::More tests => 57;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -61,18 +61,22 @@ sub wait_standbys
 	$node_master->wait_for_catchup($node_slot_replica, 'replay', $lsn);
 }
 
+sub sync_up
+{
+	$node_master->safe_psql('postgres', 'CHECKPOINT;');
+	wait_standbys();
+	restartpoint_standbys();
+	# for hot_standby_feedback wal_sender_status_interval
+	sleep(1.5);
+}
+
 # pg_basebackup doesn't copy replication slots
 is($node_slot_replica->slot('test_slot')->{'slot_name'}, undef,
 	'logical slot test_slot on master not copied by pg_basebackup');
 
-# Make sure oldestCatalogXmin lands in the control file on master
-$node_master->safe_psql('postgres', 'VACUUM;');
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
 
 my @nodes = ($node_master, $node_slot_replica, $node_noslot_replica);
-
-wait_standbys();
-restartpoint_standbys();
+sync_up();
 foreach my $node (@nodes)
 {
 	# Master had an oldestCatalogXmin, so we must've inherited it via checkpoint
@@ -154,26 +158,60 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
-wait_standbys();
-restartpoint_standbys();
+sync_up();
 foreach my $node (@nodes)
 {
 	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
 		"pg_controldata's oldestCatalogXmin is nonzero on " . $node->name);
 }
 
-# Dropping the slot must clear catalog_xmin
+# Drop the logical slot on the master; make sure feedback from standbys continues to peg
+# catalog_xmin.
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
-is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
-$node_master->safe_psql('postgres', 'VACUUM;');
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-wait_standbys();
-restartpoint_standbys();
+is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped on master');
+# Do a dummy xact so we can make sure catalog_xmin will advance, and we can see that
+# catalog_xmin will advance along with it.
+my $xmin = $node_master->safe_psql('postgres', 'BEGIN; CREATE TABLE dummy_xact(blah integer); SELECT txid_current(); COMMIT;');
+
+# even though the logical slot on the upstream is dropped, master's
+# oldestCatalogXmin is held down by hot standby feedback from the replicas.
+# Since the replicas have no logical slots of their own, it should've advanced
+# to be the same as the physical slot xmin for the slot replica.
+sync_up();
+# There are no transactions on the replicas so their xmin and catalog_xmin
+# will both be nextXid.
+cmp_ok($node_master->slot('slot_replica')->{'xmin'}, "eq", $xmin + 1,
+	'xmin advanced to latest master xid on slot_replica on master');
+cmp_ok($node_master->slot('slot_replica')->{'catalog_xmin'}, "le", $xmin + 1,
+	'xmin == catalog_xmin on phys slot held down by standby catalog_xmin');
+# Control files will still contain the xid, since there won't have been another
+# checkpoint to advance the nextXid reported by feedback and write it to the
+# control file.
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:$xmin$/m,
+		"pg_controldata's oldestCatalogXmin advanced after drop, vacuum and checkpoint on " . $node->name);
+}
+
+# if we turn hot_standby_feedback off on the replica that uses a slot, the
+# master should no longer have anything holding down its catalog_xmin. Even
+# though hot_standby_feedback is still enabled on the non-slot replica, it
+# cannot set the master's catalog_xmin because it has no destination slot,
+# it can only set xmin in its procarray entry.
+$node_slot_replica->safe_psql('postgres', q[ALTER SYSTEM SET hot_standby_feedback = off;]);
+# simplest way to force new hot standby feedback to be sent
+$node_slot_replica->restart;
+sleep(1);
+# hot standby feedback should've cleared minimums
+is($node_master->slot('slot_replica')->{'xmin'}, '', 'phys slot xmin null with hs_feedback off');
+is($node_master->slot('slot_replica')->{'catalog_xmin'}, '', 'phys slot catalog_xmin null with hs_feedback off');
+sync_up();
+# Everyone should now see the cleared catalog_xmin
 foreach my $node (@nodes)
 {
 	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
-		"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint on " . $node->name);
+		"pg_controldata's oldestCatalogXmin zero after turning off hs_feedback: " . $node->name);
 }
 
 foreach my $node (@nodes)
diff --git a/src/test/recovery/t/012_logical_decoding_on_replica.pl b/src/test/recovery/t/012_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..6ed0abc
--- /dev/null
+++ b/src/test/recovery/t/012_logical_decoding_on_replica.pl
@@ -0,0 +1,497 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 63;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent
+
+# If no slot on standby exists to hold down catalog_xmin it must follow xmin,
+# (which is nextXid when no xacts are running on the standby).
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+is($xmin, $catalog_xmin, "xmin and catalog_xmin equal");
+
+# We need catalog_xmin advance to take effect on the master and be replayed
+# on standby.
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+diag "creating slot standby_logical";
+my $start_time = [Time::HiRes::gettimeofday()];
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	diag "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+diag "Testing catalog_xmin retention with hs_feedback on";
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# a new decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Data-only changes, no effect on catalogs. We should replay them fine
+# without a conflict, since they advance xmin but not catalog_xmin.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+$node_master->safe_psql('testdb', 'VACUUM FULL test_table');
+$node_master->safe_psql('testdb', 'VACUUM;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+diag "pumping";
+$handle->pump;
+diag "pumped";
+
+ok($node_replica->slot('standby_logical')->{'active_pid'}, 'pg_recvlogical still connected to slot');
+
+# If we change the catalogs, we'll get a conflict with recovery, but only if
+# there's an active xact when decoding.
+diag "dropping dummy_table";
+$node_master->safe_psql('testdb', 'DROP TABLE dummy_table;');
+
+diag "waiting for catchup";
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+diag "caught up, waiting for client";
+
+# client dies?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+# make sure we see the effect promptly
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2);
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, 'xmin on phys slot non-null after re-establishing hot standby feedback');
+ok($catalog_xmin, 'catalog_xmin on phys slot non-null after re-establishing hot standby feedback')
+	or BAIL_OUT('further results meaningless if catalog_xmin not set on master');
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_xmin, $new_catalog_xmin) = print_phys_xmin();
+# We're now back to the old behaviour of hot_standby_feedback
+# reporting nextXid for both thresholds
+ok($new_catalog_xmin, "physical catalog_xmin still non-null");
+cmp_ok($new_catalog_xmin, 'gt', $catalog_xmin,
+	'catalog_xmin increased after slot drop');
+cmp_ok($new_catalog_xmin, 'eq', $new_xmin,
+	'xmin and catalog_xmin equal after slot drop');
+
+
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+diag "Testing dropdb when downstream slot is not in-use";
+diag "creating slot dodropslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot')
+	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+diag "creating slot otherslot";
+$start_time = [Time::HiRes::gettimeofday()];
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot')
+	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+diag sprintf("Creation took %.2d seconds", Time::HiRes::tv_interval($start_time));
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+diag "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+diag "creaitng slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot']);
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+diag "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+diag "pg_recvlogical backend pid is " . $node_replica->slot('dodropslot2')->{'active_pid'};
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	diag "waiting for walsender to exit";
+}
+
+diag "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#97Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#95)
Re: Logical decoding on standby

On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#97)
3 attachment(s)
Re: Logical decoding on standby

On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

I tend to agree that it's late in the piece. It's still worth cleaning
it up into a state ready for early pg11 though.

I've just fixed an issue where hot_standby_feedback on a physical slot
could cause oldestCatalogXmin to go backwards. When the slot's
catalog_xmin was 0 and is being set for the first time the standby's
supplied catalog_xmin is trusted. To fix it, in
PhysicalReplicationSlotNewXmin when setting catalog_xmin from 0, clamp
the value to the master's GetOldestSafeDecodingTransactionId().

Tests are cleaned up and fixed.

This series adds full support for logical decoding on a standby.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

01-log-catalog-xmin-advances-v7.patchtext/x-patch; charset=US-ASCII; name=01-log-catalog-xmin-advances-v7.patchDownload
From 24e2baea15c4f435789c7fda5ddc9feae8a7012f Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 22 Mar 2017 13:36:49 +0800
Subject: [PATCH 1/3] Log catalog_xmin advances before removing catalog tuples

Before advancing the effective catalog_xmin we use to remove old catalog
tuple versions, make sure it is written to WAL. This allows standbys
to know the oldest xid they can safely create a historic snapshot for.
They can then refuse to start decoding from a slot or raise a recovery
conflict.

The catalog_xmin advance is logged in a new xl_catalog_xmin_advance record,
emitted before vacuum or periodically by the bgwriter. WAL is only written if
the lowest catalog_xmin needed by any replication slot has advanced.
---
 src/backend/access/heap/rewriteheap.c       |   3 +-
 src/backend/access/rmgrdesc/xactdesc.c      |   9 ++
 src/backend/access/rmgrdesc/xlogdesc.c      |   3 +-
 src/backend/access/transam/varsup.c         |  15 ++++
 src/backend/access/transam/xact.c           |  36 ++++++++
 src/backend/access/transam/xlog.c           |  23 ++++-
 src/backend/postmaster/bgwriter.c           |   9 ++
 src/backend/replication/logical/decode.c    |  12 +++
 src/backend/replication/walreceiver.c       |   2 +-
 src/backend/replication/walsender.c         |  46 +++++++++-
 src/backend/storage/ipc/procarray.c         | 134 ++++++++++++++++++++++++++--
 src/bin/pg_controldata/pg_controldata.c     |   2 +
 src/include/access/transam.h                |   5 ++
 src/include/access/xact.h                   |  12 ++-
 src/include/catalog/pg_control.h            |   1 +
 src/include/storage/procarray.h             |   5 +-
 src/test/recovery/t/006_logical_decoding.pl |  90 +++++++++++++++++--
 17 files changed, 383 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..d1400ec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
 	if (!state->rs_logical_rewrite)
 		return;
 
-	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
+	/* Use oldestCatalogXmin here */
+	ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
 
 	/*
 	 * If there are no logical slots in progress we don't need to do anything,
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..96ea163 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -297,6 +297,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		appendStringInfo(buf, "catalog_xmin %u", xlrec->new_catalog_xmin);
+	}
 }
 
 const char *
@@ -324,6 +330,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+			id = "CATALOG_XMIN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..a66cfc6 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -47,7 +47,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
-						 "oldest running xid %u; %s",
+						 "oldest running xid %u; oldest catalog xmin %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -63,6 +63,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
+						 checkpoint->oldestCatalogXmin,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5efbfbd..ffabf1c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -414,6 +414,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	}
 }
 
+/*
+ * Set the global oldest catalog_xmin used to determine when tuples
+ * may be removed from catalogs and user-catalogs accessible from logical
+ * decoding.
+ *
+ * Only to be called from the startup process or from LogCurrentRunningXacts()
+ * which ensures the update is properly written to xlog first.
+ */
+void
+SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
+{
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->oldestCatalogXmin = oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+}
 
 /*
  * ForceTransactionIdLimitUpdate -- does the XID wrap-limit data need updating?
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..63453d7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5652,6 +5652,42 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
+	{
+		xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
+
+		/*
+		 * Apply the new catalog_xmin limit immediately. New decoding sessions
+		 * will refuse to start if their slot is past it, and old ones will
+		 * notice when we signal them with a recovery conflict. There's no
+		 * effect on the catalogs themselves yet, so it's safe for backends
+		 * with older catalog_xmins to still exist.
+		 *
+		 * We don't have to take ProcArrayLock since only the startup process
+		 * is allowed to change oldestCatalogXmin when we're in recovery.
+		 *
+		 * Existing sessions are not notified and must check the safe xmin.
+		 */
+		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
+
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * Record when we advance the catalog_xmin used for tuple removal
+ * so standbys find out before we remove catalog tuples they might
+ * need for logical decoding.
+ */
+XLogRecPtr
+XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
+{
+	xl_xact_catalog_xmin_advance xlrec;
+
+	xlrec.new_catalog_xmin = new_catalog_xmin;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
+	return XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..8d713e9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5021,6 +5021,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
@@ -6611,6 +6612,12 @@ StartupXLOG(void)
 	   (errmsg_internal("oldest unfrozen transaction ID: %u, in database %u",
 						checkPoint.oldestXid, checkPoint.oldestXidDB)));
 	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
+			(errmsg_internal("oldest catalog-only transaction ID: %u",
+							 checkPoint.oldestCatalogXmin)));
+	ereport(DEBUG1,
 			(errmsg_internal("oldest MultiXactId: %u, in database %u",
 						 checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
 	ereport(DEBUG1,
@@ -6628,6 +6635,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+	SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
 	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
 					 checkPoint.newestCommitTsXid);
@@ -8537,6 +8545,9 @@ CreateCheckPoint(int flags)
 	 */
 	InitXLogInsert();
 
+	/* Checkpoints are a handy time to update the effective catalog_xmin */
+	UpdateOldestCatalogXmin();
+
 	/*
 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
 	 * (This is just pro forma, since in the present system structure there is
@@ -8726,6 +8737,10 @@ CreateCheckPoint(int flags)
 							 &checkPoint.oldestMulti,
 							 &checkPoint.oldestMultiDB);
 
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	checkPoint.oldestCatalogXmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+
 	/*
 	 * Having constructed the checkpoint record, ensure all shmem disk buffers
 	 * and commit-log buffers are flushed to disk.
@@ -9632,6 +9647,8 @@ xlog_redo(XLogReaderState *record)
 		 */
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
 		 * record, the backup was canceled and the end-of-backup record will
@@ -9729,8 +9746,10 @@ xlog_redo(XLogReaderState *record)
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
-			SetTransactionIdLimit(checkPoint.oldestXid,
-								  checkPoint.oldestXidDB);
+			SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+
+		SetOldestCatalogXmin(checkPoint.oldestCatalogXmin);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf2..3bb5200 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -51,6 +51,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
@@ -333,6 +334,14 @@ BackgroundWriterMain(void)
 				last_snapshot_lsn = LogStandbySnapshot();
 				last_snapshot_ts = now;
 			}
+
+			/*
+			 * We can also advance the threshold used for catalog tuple
+			 * cleanup, rate-limited so we don't write it too often. The delay
+			 * slightly increases catalog bloat but reduces the volume of
+			 * catalog_xmin advance records written.
+			 */
+			UpdateOldestCatalogXmin();
 		}
 
 		/*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..b5084b9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -288,6 +288,18 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
 			break;
+		case XLOG_XACT_CATALOG_XMIN_ADV:
+
+			/*
+			 * The global catalog_xmin has been advanced. By the time we see
+			 * this in logical decoding it no longer matters, since it's
+			 * guaranteed that all later records will be consistent with the
+			 * advanced catalog_xmin, so we ignore it here. If we were running
+			 * on a standby and it applied a catalog xmin advance past our
+			 * needed catalog_xmin we would've already been terminated with a
+			 * conflict with standby error.
+			 */
+			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index df93265..277f196 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1234,7 +1234,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 		xmin = GetOldestXmin(NULL,
 							 PROCARRAY_FLAGS_DEFAULT|PROCARRAY_SLOTS_XMIN);
 
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
+		ProcArrayGetReplicationSlotXmin(&slot_xmin, NULL, &catalog_xmin);
 
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dbb10c7..e64054b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1778,15 +1778,55 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 		slot->data.xmin = feedbackXmin;
 		slot->effective_xmin = feedbackXmin;
 	}
+	/*
+	 * If the physical slot is relaying catalog_xmin for logical replication
+	 * slots on the replica it's safe to act on catalog_xmin advances
+	 * immediately too. The replica will only send a new catalog_xmin via
+	 * feedback when it advances its effective_catalog_xmin, so it's done the
+	 * delay-until-confirmed dance for us and knows it won't need the data
+	 * we're protecting from vacuum again.
+	 */
 	if (!TransactionIdIsNormal(slot->data.catalog_xmin) ||
 		!TransactionIdIsNormal(feedbackCatalogXmin) ||
 		TransactionIdPrecedes(slot->data.catalog_xmin, feedbackCatalogXmin))
 	{
+		/*
+		 * If the standby is setting a catalog_xmin for the first time we must
+		 * check that it's within our global xmin horizon so we don't lock in a
+		 * value we might've already removed tuples for. The standby might have
+		 * an outdated catalog_xmin locally if it's lagging and we can't blindly
+		 * trust it, since we'd then update oldestCatalogXmin with a value that's
+		 * not actually safe.
+		 */
+		if (TransactionIdIsValid(feedbackCatalogXmin) &&
+			!TransactionIdIsValid(slot->effective_catalog_xmin))
+		{
+			TransactionId lowerBound;
+
+			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+			lowerBound = GetOldestSafeDecodingTransactionId();
+			if (TransactionIdPrecedes(feedbackCatalogXmin, lowerBound))
+				feedbackCatalogXmin = lowerBound;
+
+			slot->effective_catalog_xmin = feedbackCatalogXmin;
+			slot->data.catalog_xmin = slot->effective_catalog_xmin;
+
+			SpinLockRelease(&slot->mutex);
+			ReplicationSlotsComputeRequiredXmin(true);
+
+			LWLockRelease(ProcArrayLock);
+		}
+		else
+		{
+			slot->data.catalog_xmin = feedbackCatalogXmin;
+			slot->effective_catalog_xmin = feedbackCatalogXmin;
+			SpinLockRelease(&slot->mutex);
+		}
 		changed = true;
-		slot->data.catalog_xmin = feedbackCatalogXmin;
-		slot->effective_catalog_xmin = feedbackCatalogXmin;
 	}
-	SpinLockRelease(&slot->mutex);
+	else
+		SpinLockRelease(&slot->mutex);
 
 	if (changed)
 	{
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7c2e1e1..9e98af8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -87,7 +87,12 @@ typedef struct ProcArrayStruct
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
-	/* oldest catalog xmin of any replication slot */
+
+	/*
+	 * Oldest catalog xmin of any replication slot
+	 *
+	 * See also ShmemVariableCache->oldestGlobalXmin
+	 */
 	TransactionId replication_slot_catalog_xmin;
 
 	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
@@ -1306,6 +1311,9 @@ TransactionIdIsActive(TransactionId xid)
  * The return value is also adjusted with vacuum_defer_cleanup_age, so
  * increasing that setting on the fly is another easy way to make
  * GetOldestXmin() move backwards, with no consequences for data integrity.
+ *
+ * When changing GetOldestXmin, check to see whether RecentGlobalXmin
+ * computation in GetSnapshotData also needs changing.
  */
 TransactionId
 GetOldestXmin(Relation rel, int flags)
@@ -1444,6 +1452,89 @@ GetOldestXmin(Relation rel, int flags)
 }
 
 /*
+ * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
+ * to reflect an advance in procArray->replication_slot_catalog_xmin or
+ * it becoming newly set or unset.
+ *
+ */
+static bool
+CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
+{
+	return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
+			|| (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
+}
+
+/*
+ * If necessary, copy the current catalog_xmin needed by replication slots to
+ * the effective catalog_xmin used for dead tuple removal and write a WAL
+ * record recording the change.
+ *
+ * This allows standbys to know the oldest xid for which it is safe to create
+ * a historic snapshot for logical decoding. VACUUM or other cleanup may have
+ * removed catalog tuple versions needed to correctly decode transactions older
+ * than this threshold. Standbys can use this information to cancel conflicting
+ * decoding sessions and invalidate slots that need discarded information.
+ *
+ * (We can't use the transaction IDs in WAL records emitted by VACUUM etc for
+ * this, since they don't identify the relation as a catalog or not.  Nor can a
+ * standby look up the relcache to get the Relation for the affected
+ * relfilenode to check if it is a catalog. The standby would also have no way
+ * to know the oldest safe position at startup if it wasn't in the control
+ * file.)
+ */
+void
+UpdateOldestCatalogXmin(void)
+{
+	TransactionId vacuum_catalog_xmin;
+	TransactionId slots_catalog_xmin;
+
+	Assert(XLogInsertAllowed());
+
+	/*
+	 * It's most likely that replication_slot_catalog_xmin and
+	 * oldestCatalogXmin will be the same and no action is required, so do a
+	 * pre-check before doing expensive WAL writing and exclusive locking.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	vacuum_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+	slots_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	LWLockRelease(ProcArrayLock);
+
+	if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+	{
+		/*
+		 * We must prevent a concurrent checkpoint, otherwise the catalog xmin
+		 * advance xlog record with the new value might be written before the
+		 * checkpoint but the checkpoint may still see the old
+		 * oldestCatalogXmin value.
+		 */
+		if (!LWLockConditionalAcquire(CheckpointLock, LW_SHARED))
+			/* Couldn't get checkpointer lock; will retry later */
+			return;
+
+		XactLogCatalogXminUpdate(slots_catalog_xmin);
+
+		/*
+		 * A concurrent updater could've changed the oldestCatalogXmin so we
+		 * need to re-check under ProcArrayLock before updating. The LWLock
+		 * provides a barrier.
+		 *
+		 * We must not re-read replication_slot_catalog_xmin even if it has
+		 * advanced, since we xlog'd the older value. If it advanced since, a
+		 * later run will xlog the new value and advance.
+		 */
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		vacuum_catalog_xmin = *((volatile TransactionId *) &ShmemVariableCache->oldestCatalogXmin);
+		if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
+			ShmemVariableCache->oldestCatalogXmin = slots_catalog_xmin;
+		LWLockRelease(ProcArrayLock);
+
+		LWLockRelease(CheckpointLock);
+	}
+
+}
+
+/*
  * GetMaxSnapshotXidCount -- get max size for snapshot XID array
  *
  * We have to export this for use by snapmgr.c.
@@ -1700,7 +1791,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/* fetch into volatile var while ProcArrayLock is held */
 	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	replication_slot_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
 
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
@@ -1711,6 +1802,9 @@ GetSnapshotData(Snapshot snapshot)
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
 	 * the same result.
+	 *
+	 * If you change computation of RecentGlobalXmin here you may need to
+	 * change GetOldestXmin(...) as well.
 	 */
 	if (TransactionIdPrecedes(xmin, globalxmin))
 		globalxmin = xmin;
@@ -2041,12 +2135,16 @@ GetRunningTransactionData(void)
 	}
 
 	/*
-	 * It's important *not* to include the limits set by slots here because
+	 * It's important *not* to include the xmin set by slots here because
 	 * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
 	 * were to be included here the initial value could never increase because
-	 * of a circular dependency where slots only increase their limits when
-	 * running xacts increases oldestRunningXid and running xacts only
+	 * of a circular dependency where slots only increase their xmin limits
+	 * when running xacts increases oldestRunningXid and running xacts only
 	 * increases if slots do.
+	 *
+	 * We can safely report the catalog_xmin limit for replication slots here
+	 * because it's only used to advance oldestCatalogXmin. Slots'
+	 * catalog_xmin advance does not depend on it so there's no circularity.
 	 */
 
 	CurrentRunningXacts->xcnt = count - subcount;
@@ -2171,6 +2269,13 @@ GetOldestSafeDecodingTransactionId(void)
 	 * If there's already a slot pegging the xmin horizon, we can start with
 	 * that value, it's guaranteed to be safe since it's computed by this
 	 * routine initially and has been enforced since.
+	 *
+	 * We don't use ShmemVariableCache->oldestCatalogXmin here because another
+	 * backend may have already logged its intention to advance it to a higher
+	 * value (still <= replication_slot_catalog_xmin) and just be waiting on
+	 * ProcArrayLock to actually apply the change. On a standby
+	 * replication_slot_catalog_xmin is what the walreceiver will be sending
+	 * in hot_standby_feedback, not oldestCatalogXmin.
 	 */
 	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
 		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
@@ -2965,18 +3070,31 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  *
  * Return the current slot xmin limits. That's useful to be able to remove
  * data that's older than those limits.
+ *
+ * For logical replication slots' catalog_xmins, we return both the effective
+ * catalog_xmin being used for tuple removal (retained catalog_xmin) and the
+ * catalog_xmin actually needed by replication slots (needed_catalog_xmin).
+ *
+ * retained_catalog_xmin should be older than needed_catalog_xmin but is not
+ * guaranteed to be if there are replication slots on a replica currently
+ * attempting to start up and reserve catalogs, outdated replicas sending
+ * feedback, etc.
  */
 void
 ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin)
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin)
 {
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	if (xmin != NULL)
 		*xmin = procArray->replication_slot_xmin;
 
-	if (catalog_xmin != NULL)
-		*catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (retained_catalog_xmin != NULL)
+		*retained_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+
+	if (needed_catalog_xmin != NULL)
+		*needed_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..5c7eb77 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -248,6 +248,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestCatalogXmin:%u\n"),
+		   ControlFile->checkPointCopy.oldestCatalogXmin);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index d25a2dd..c2cb0a1 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -134,6 +134,10 @@ typedef struct VariableCacheData
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
+	TransactionId oldestCatalogXmin;	/* oldestCatalogXmin guarantees that
+										 * no valid catalog tuples >= than it
+										 * are removed. That property is used
+										 * for logical decoding. */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -179,6 +183,7 @@ extern TransactionId GetNewTransactionId(bool isSubXact);
 extern TransactionId ReadNewTransactionId(void);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 					  Oid oldest_datoid);
+extern void SetOldestCatalogXmin(TransactionId oldestCatalogXmin);
 extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..6d18d18 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -137,7 +137,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_CATALOG_XMIN_ADV	0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -187,6 +187,13 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+typedef struct xl_xact_catalog_xmin_advance
+{
+	TransactionId new_catalog_xmin;
+}	xl_xact_catalog_xmin_advance;
+
+#define SizeOfXactCatalogXminAdvance (offsetof(xl_xact_catalog_xmin_advance, new_catalog_xmin) + sizeof(TransactionId))
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -391,6 +398,9 @@ extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
 				   int nsubxacts, TransactionId *subxacts,
 				   int nrels, RelFileNode *rels,
 				   int xactflags, TransactionId twophase_xid);
+
+extern XLogRecPtr XactLogCatalogXminUpdate(TransactionId new_catalog_xmin);
+
 extern void xact_redo(XLogReaderState *record);
 
 /* xactdesc.c */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..1fe89ae 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -45,6 +45,7 @@ typedef struct CheckPoint
 	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
+	TransactionId oldestCatalogXmin;	/* catalog retained after this xid */
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49..69a82d7 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -120,6 +120,9 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 							TransactionId catalog_xmin, bool already_locked);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
-								TransactionId *catalog_xmin);
+								TransactionId *retained_catalog_xmin,
+								TransactionId *needed_catalog_xmin);
+
+extern void UpdateOldestCatalogXmin(void);
 
 #endif   /* PROCARRAY_H */
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index bf9b50a..80b976b 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,24 +7,79 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 44;
 
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
-$node_master->append_conf(
-		'postgresql.conf', qq(
+$node_master->append_conf('postgresql.conf', qq(
 wal_level = logical
+hot_standby_feedback = on
+wal_receiver_status_interval = 1
+log_min_messages = debug1
 ));
 $node_master->start;
-my $backup_name = 'master_backup';
 
+# Set up some changes before we make base backups
 $node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
 $node_master->safe_psql('postgres', qq[SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');]);
 
 $node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
 
+# Launch two streaming replicas, one with and one without
+# physical replication slots. We'll use these for tests
+# involving interaction of logical and physical standby.
+#
+# Both backups are created with pg_basebackup.
+#
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+$node_master->safe_psql('postgres', q[SELECT pg_create_physical_replication_slot('slot_replica');]);
+my $node_slot_replica = get_new_node('slot_replica');
+$node_slot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_slot_replica->append_conf('recovery.conf', "primary_slot_name = 'slot_replica'");
+
+my $node_noslot_replica = get_new_node('noslot_replica');
+$node_noslot_replica->init_from_backup($node_master, $backup_name, has_streaming => 1);
+
+$node_slot_replica->start;
+$node_noslot_replica->start;
+
+sub restartpoint_standbys
+{
+	# Force restartpoints to update control files on replicas
+	$node_slot_replica->safe_psql('postgres', 'CHECKPOINT');
+	$node_noslot_replica->safe_psql('postgres', 'CHECKPOINT');
+}
+
+sub wait_standbys
+{
+	my $lsn = $node_master->lsn('insert');
+	$node_master->wait_for_catchup($node_noslot_replica, 'replay', $lsn);
+	$node_master->wait_for_catchup($node_slot_replica, 'replay', $lsn);
+}
+
+# pg_basebackup doesn't copy replication slots
+is($node_slot_replica->slot('test_slot')->{'slot_name'}, undef,
+	'logical slot test_slot on master not copied by pg_basebackup');
+
+# Make sure oldestCatalogXmin lands in the control file on master
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my @nodes = ($node_master, $node_slot_replica, $node_noslot_replica);
+
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	# Master had an oldestCatalogXmin, so we must've inherited it via checkpoint
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero after start on " . $node->name);
+}
+
 # Basic decoding works
 my($result) = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
 is(scalar(my @foobar = split /^/m, $result), 12, 'Decoding produced 12 rows inc BEGIN/COMMIT');
@@ -64,6 +119,9 @@ $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpo
 chomp($stdout_recv);
 is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
 
+# Create a second DB we'll use for testing dropping and accessing slots across
+# databases. This matters since logical slots are globally visible objects that
+# can only actually be used on one DB for most purposes.
 $node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
 
 is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
@@ -96,9 +154,29 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
+		"pg_controldata's oldestCatalogXmin is nonzero on " . $node->name);
+}
+
+# Dropping the slot must clear catalog_xmin
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
 is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
+$node_master->safe_psql('postgres', 'VACUUM;');
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+wait_standbys();
+restartpoint_standbys();
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+		"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint on " . $node->name);
+}
 
-# done with the node
-$node_master->stop;
+foreach my $node (@nodes)
+{
+	$node->stop;
+}
-- 
2.5.5

02-decoding-recovery-conflicts-v7.patchtext/x-patch; charset=US-ASCII; name=02-decoding-recovery-conflicts-v7.patchDownload
From 9036702eb645acaf3ec660d511c62b09b816f73e Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 3 Apr 2017 17:31:19 +0800
Subject: [PATCH 2/3] Support conflict with standby on logical walsender

Detect and resolve conflicts between walsenders or SQL-level
logical decoding sessions and catalog_xmin advances. Refuse to
start decoding from a logical slot whose catalog_xmin is below
the cluster-wide known-safe threshold so new sessions cannot
start.

Slots are not persistently marked as invalid and will continue to hold down
xlog and (on master) catalog retention. There is no way to restore them to
working order, so the application or administrator must drop them
to release resources.
---
 src/backend/access/heap/heapam.c          |   2 +-
 src/backend/access/transam/xact.c         |   6 +-
 src/backend/access/transam/xlog.c         |   3 +-
 src/backend/replication/logical/logical.c | 135 ++++++++++++++++++++++++++++++
 src/backend/replication/slot.c            |   4 +-
 src/backend/replication/walsender.c       |  14 +---
 src/backend/storage/ipc/procarray.c       |  51 +++++++++++
 src/backend/storage/ipc/procsignal.c      |   3 +
 src/backend/storage/ipc/standby.c         |   7 +-
 src/backend/tcop/postgres.c               |  43 +++++++---
 src/include/storage/procarray.h           |   2 +
 src/include/storage/procsignal.h          |   1 +
 src/include/storage/standby.h             |   3 +
 13 files changed, 241 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..93bf143 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7273,7 +7273,7 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  * ratchet forwards latestRemovedXid to the greatest one found.
  * This is used as the basis for generating Hot Standby conflicts, so
  * if a tuple was never visible then removing it should not conflict
- * with queries.
+ * with queries or logical decoding output plugin callbacks.
  */
 void
 HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 63453d7..48ca884 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5662,14 +5662,10 @@ xact_redo(XLogReaderState *record)
 		 * notice when we signal them with a recovery conflict. There's no
 		 * effect on the catalogs themselves yet, so it's safe for backends
 		 * with older catalog_xmins to still exist.
-		 *
-		 * We don't have to take ProcArrayLock since only the startup process
-		 * is allowed to change oldestCatalogXmin when we're in recovery.
-		 *
-		 * Existing sessions are not notified and must check the safe xmin.
 		 */
 		SetOldestCatalogXmin(xlrec->new_catalog_xmin);
 
+		ResolveRecoveryConflictWithLogicalDecoding(xlrec->new_catalog_xmin);
 	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d713e9..a98601a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8546,7 +8546,8 @@ CreateCheckPoint(int flags)
 	InitXLogInsert();
 
 	/* Checkpoints are a handy time to update the effective catalog_xmin */
-	UpdateOldestCatalogXmin();
+	if (XLogInsertAllowed())
+		UpdateOldestCatalogXmin();
 
 	/*
 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..4a15d55 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "pgstat.h"
 
 #include "access/xact.h"
 #include "access/xlog_internal.h"
@@ -38,11 +39,14 @@
 #include "replication/reorderbuffer.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/walreceiver.h"
 
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 
 #include "utils/memutils.h"
+#include "utils/ps_status.h"
 
 /* data for errcontext callback */
 typedef struct LogicalErrorCallbackState
@@ -68,6 +72,8 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
+static void EnsureActiveLogicalSlotValid(void);
+
 /*
  * Make sure the current settings & environment are capable of doing logical
  * decoding.
@@ -279,6 +285,16 @@ CreateInitDecodingContext(char *plugin,
 	LWLockRelease(ProcArrayLock);
 
 	/*
+	 * If this is the first slot created on the master we won't have a
+	 * persistent record of the oldest safe xid for historic snapshots yet.
+	 * Force one to be recorded so that when we go to replay from this slot we
+	 * know it's safe.
+	 */
+	if (!RecoveryInProgress() &&
+		!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+		UpdateOldestCatalogXmin();
+
+	/*
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
 	 */
@@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+	EnsureActiveLogicalSlotValid();
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId,
 								 read_page, prepare_write, do_write);
@@ -963,3 +981,120 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Test to see if the active logical slot is usable.
+ */
+static void
+EnsureActiveLogicalSlotValid(void)
+{
+	TransactionId shmem_catalog_xmin;
+
+	Assert(MyReplicationSlot != NULL);
+
+	/*
+	 * A logical slot can become unusable if we're doing logical decoding on a
+	 * standby or using a slot created before we were promoted from standby
+	 * to master. If the master advanced its global catalog_xmin past the
+	 * threshold we need it could've removed catalog tuple versions that
+	 * we'll require to start decoding at our restart_lsn.
+	 */
+
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	shmem_catalog_xmin = ShmemVariableCache->oldestCatalogXmin;
+	LWLockRelease(ProcArrayLock);
+
+	if (!TransactionIdIsValid(shmem_catalog_xmin) ||
+		TransactionIdFollows(shmem_catalog_xmin, MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("replication slot '%s' requires catalogs removed by master",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("need catalog_xmin %u, have oldestCatalogXmin %u",
+						   MyReplicationSlot->data.catalog_xmin, shmem_catalog_xmin)));
+}
+
+/*
+ * Scan to see if any clients are using replication slots that are below a
+ * newly-applied new catalog_xmin theshold and signal them to terminate with a
+ * recovery conflict.
+ */
+void
+ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
+{
+	int i;
+
+	if (!InHotStandby)
+		/* nobody can be actively using logical slots */
+		return;
+
+	/* Already applied new limit, can't have replayed later one yet */
+	Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
+
+	/*
+	 * Find the first conflicting active slot and signal its owning backend
+	 * to exit. We'll be called repeatedly by the recovery code until there
+	 * are no more conflicts.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *slot;
+		pid_t active_pid;
+
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Physical slots can have a catalog_xmin, but conflicts are the
+		 * problem of the leaf replica with the logical slot.
+		 */
+		if (!(slot->in_use && SlotIsLogical(slot)))
+			continue;
+
+		/*
+		 * We only care about the effective_catalog_xmin of active logical
+		 * slots. Anything else gets checked when a new decoding session tries
+		 * to start.
+		 */
+		 while (slot->in_use && slot->active_pid != 0 &&
+				TransactionIdIsValid(slot->effective_catalog_xmin) &&
+				(!TransactionIdIsValid(new_catalog_xmin) ||
+				 TransactionIdPrecedes(slot->effective_catalog_xmin, new_catalog_xmin)))
+		{
+			/*
+			 * We'll be sleeping, so release the control lock. New conflicting
+			 * backends cannot appear and if old ones go away that's what we
+			 * want, so release and re-acquire is OK here.
+			 */
+			active_pid = slot->active_pid;
+			LWLockRelease(ReplicationSlotControlLock);
+
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				ereport(INFO,
+						(errmsg("terminating logical decoding session due to recovery conflict"),
+						 errdetail("Pid %u requires catalog_xmin %u for replication slot '%s' but the master has removed catalogs up to xid %u.",
+								   active_pid, slot->effective_catalog_xmin,
+								   NameStr(slot->data.name), new_catalog_xmin)));
+
+				/*
+				 * Signal the proc. If the slot is already released or even if
+				 * pid is re-used we don't care, backends are required to
+				 * tolerate spurious recovery signals.
+				 */
+				CancelLogicalDecodingSessionWithRecoveryConflict(active_pid);
+
+				/* Don't flood the system with signals */
+				pg_usleep(10000);
+			}
+
+			/*
+			 * We need to re-acquire the lock before re-checking the slot or
+			 * continuing the scan.
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+		}
+
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6c5ec7a..57a3994 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -48,6 +48,7 @@
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/standby.h"
 #include "utils/builtins.h"
 
 /*
@@ -931,7 +932,8 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. We can't do that on a standby; there we must wait for the
+		 * bgwriter to get around to logging its periodic standby snapshot.
 		 *
 		 * That's not needed (or indeed helpful) for physical slots as they'll
 		 * start replay at the last logged checkpoint anyway. Instead return
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e64054b..5d60e7a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -212,7 +212,6 @@ static struct
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -2863,17 +2862,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2907,7 +2895,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 9e98af8..05e3058 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2762,6 +2762,57 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 }
 
 /*
+ * Notify a logical decoding session that it conflicts with newly set
+ * catalog_xmin from the master. We're about to start replaying WAL
+ * that will make its historic snapshot potentially unsafe by removing
+ * system tuples it might need.
+ */
+void
+CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int			index;
+	BackendId	backend_id = InvalidBackendId;
+
+	/*
+	 * We have to scan ProcArray to find the process and set a pending recovery
+	 * conflict even though we know the pid. At least we can get the BackendId
+	 * and avoid a ProcSignal scan by SendProcSignal.
+	 *
+	 * The pid might've gone away, in which case we got the desired
+	 * outcome anyway.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	for (index = 0; index < arrayP->numProcs; index++)
+	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		volatile PGPROC *proc = &allProcs[pgprocno];
+
+		if (proc->pid == session_pid)
+		{
+			VirtualTransactionId procvxid;
+
+			GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+			proc->recoveryConflictPending = true;
+			backend_id = procvxid.backendId;
+			break;
+		}
+	}
+
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Kill the pid if it's still here. If not, that's what we
+	 * wanted so ignore any errors.
+	 */
+	if (backend_id != InvalidBackendId)
+		(void) SendProcSignal(session_pid,
+			PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, backend_id);
+}
+
+/*
  * MinimumActiveBackends --- count backends (other than myself) that are
  *		in active transactions.  Return true if the count exceeds the
  *		minimum threshold passed.  This is used as a heuristic to decide if
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4a21d55..16c2e1f 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -273,6 +273,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f93..f6106ca 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "replication/slot.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -152,11 +153,13 @@ GetStandbyLimitTime(void)
 static int	standbyWait_us = STANDBY_INITIAL_WAIT_US;
 
 /*
- * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs and
+ * ResolveRecoveryConflictWithLogicalDecoding.
+ *
  * We wait here for a while then return. If we decide we can't wait any
  * more then we return true, if we can wait some more return false.
  */
-static bool
+bool
 WaitExceedsMaxStandbyDelay(void)
 {
 	TimestampTz ltime;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a2282058..530dcbe 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2276,6 +2276,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
+			errdetail("Logical replication slot requires catalog rows that will be removed.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 			errdetail("User was connected to a database that must be dropped.");
 			break;
@@ -2698,8 +2701,12 @@ SigHupHandler(SIGNAL_ARGS)
 /*
  * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
  * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * and StatementCancelHandler().
+ *
+ * Called by normal user backends running during recovery. Also used by the
+ * walsender to handle recovery conflicts with logical decoding, and by
+ * background workers that call CHECK_FOR_INTERRUPTS() and respect recovery
+ * conflicts.
  */
 void
 RecoveryConflictInterrupt(ProcSignalReason reason)
@@ -2781,6 +2788,7 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional drop through to session cancel */
 
+			case PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN:
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
@@ -2795,12 +2803,18 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
 
 		/*
-		 * All conflicts apart from database cause dynamic errors where the
-		 * command or transaction can be retried at a later point with some
-		 * potential for success. No need to reset this, since non-retryable
-		 * conflict errors are currently FATAL.
+		 * All conflicts apart from database and catalog_xmin cause dynamic
+		 * errors where the command or transaction can be retried at a later
+		 * point with some potential for success. No need to reset this, since
+		 * non-retryable conflict errors are currently FATAL.
+		 *
+		 * catalog_xmin is non-retryable because once we advance the
+		 * catalog_xmin threshold we might replay wal that removes
+		 * needed catalog tuples. The slot can't (re)start decoding
+		 * because its catalog_xmin cannot be satisifed.
 		 */
-		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
+		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+			reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
 			RecoveryConflictRetryable = false;
 	}
 
@@ -2855,11 +2869,20 @@ ProcessInterrupts(void)
 		}
 		else if (RecoveryConflictPending)
 		{
-			/* Currently there is only one non-retryable recovery conflict */
-			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
+			int code;
+
+			Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
+				   RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN);
+
+			if (RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
+				/* XXX more appropriate error code? */
+				code = ERRCODE_PROGRAM_LIMIT_EXCEEDED;
+			else
+				code = ERRCODE_DATABASE_DROPPED;
+
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
 			ereport(FATAL,
-					(errcode(ERRCODE_DATABASE_DROPPED),
+					(errcode(code),
 			  errmsg("terminating connection due to conflict with recovery"),
 					 errdetail_recovery_conflict()));
 		}
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 69a82d7..231297d 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -112,6 +112,8 @@ extern int	CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
 					 int *nbackends, int *nprepared);
 
+extern void CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid);
+
 extern void XidCacheRemoveRunningXids(TransactionId xid,
 						  int nxids, const TransactionId *xids,
 						  TransactionId latestXid);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index d068dde..3a3ba72 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 3ecc446..b17ba6f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -34,10 +34,13 @@ extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag);
 extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithLogicalDecoding(
+	TransactionId new_catalog_xmin);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
 extern void StandbyLockTimeoutHandler(void);
+extern bool WaitExceedsMaxStandbyDelay(void);
 
 /*
  * Standby Rmgr (RM_STANDBY_ID)
-- 
2.5.5

03-decoding-on-standby-v7.patchtext/x-patch; charset=US-ASCII; name=03-decoding-on-standby-v7.patchDownload
From 860beb565dba23c8c4f68d41f3c88a0e1789d12f Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 4 Apr 2017 11:50:30 +0800
Subject: [PATCH 3/3] Permit logical decoding on standby

Permit the creation of logical slots on replicas and permit replay from them.
Dropping logical slots on replcas was already supported.
---
 src/backend/replication/logical/logical.c          |  49 +-
 src/backend/replication/walreceiver.c              |  12 +
 src/test/recovery/t/006_logical_decoding.pl        |  70 ++-
 .../recovery/t/012_logical_decoding_on_replica.pl  | 506 +++++++++++++++++++++
 4 files changed, 605 insertions(+), 32 deletions(-)
 create mode 100644 src/test/recovery/t/012_logical_decoding_on_replica.pl

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4a15d55..35d110f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -93,23 +93,40 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			   errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*----
+		 * We really want to enforce that:
+		 * - we're connected to the primary via a replication slot
+		 * - hot_standby_feedback is enabled
+		 * - the user cannot turn hot_standby_feedback off while we have
+		 *   logical slots on the standby (it's PGC_SIGHUP)
+		 * - hot_standby_feedback has actually taken effect on the master
+		 *
+		 * ... but because the walreceiver doesn't use normal GUCs and may or
+		 * may not actually be running we can't reliably enforce those
+		 * conditions yet. We also have no way of knowing when hot standby
+		 * feedback has reached the master and locked in a catalog_xmin.
+		 *
+		 * So on standbys, slot creation or decoding from a slot may fail with
+		 * a recovery conflict. But we keep track of the master's true
+		 * catalog_xmin in WAL, so we'll never attempt to decode unsafely.
+		 *
+		 * Make a best effort sanity check anyway.
+		 *---
+		 */
+		if (!hot_standby_feedback)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires hot_standby_feedback = on")));
+
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		if (!TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("hot_standby_feedback has not yet taken effect")));
+		LWLockRelease(ProcArrayLock);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 277f196..c0f6cec 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1239,6 +1239,18 @@ XLogWalRcvSendHSFeedback(bool immed)
 		if (TransactionIdIsValid(slot_xmin) &&
 			TransactionIdPrecedes(slot_xmin, xmin))
 			xmin = slot_xmin;
+
+		/*
+		 * If there's no local catalog_xmin, report it as == xmin, so that
+		 * we lock in a catalog_xmin before we need to create any logical slots
+		 * on this standby. This won't add much catalog bloat until we create
+		 * local slots and catalog_xmin starts lagging behind xmin, but it will
+		 * cause the master to start logging
+		 * xl_xact_catalog_xmin_advance records we need for logical
+		 * decoding on standby.
+		 */
+		if (!TransactionIdIsValid(catalog_xmin) && XLogLogicalInfoActive())
+			catalog_xmin = xmin;
 	}
 	else
 	{
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 80b976b..88ddf00 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 44;
+use Test::More tests => 57;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -61,18 +61,22 @@ sub wait_standbys
 	$node_master->wait_for_catchup($node_slot_replica, 'replay', $lsn);
 }
 
+sub sync_up
+{
+	$node_master->safe_psql('postgres', 'CHECKPOINT;');
+	wait_standbys();
+	restartpoint_standbys();
+	# for hot_standby_feedback wal_sender_status_interval
+	sleep(1.5);
+}
+
 # pg_basebackup doesn't copy replication slots
 is($node_slot_replica->slot('test_slot')->{'slot_name'}, undef,
 	'logical slot test_slot on master not copied by pg_basebackup');
 
-# Make sure oldestCatalogXmin lands in the control file on master
-$node_master->safe_psql('postgres', 'VACUUM;');
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
 
 my @nodes = ($node_master, $node_slot_replica, $node_noslot_replica);
-
-wait_standbys();
-restartpoint_standbys();
+sync_up();
 foreach my $node (@nodes)
 {
 	# Master had an oldestCatalogXmin, so we must've inherited it via checkpoint
@@ -154,26 +158,60 @@ isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
 	'restored slot catalog_xmin is nonzero');
 is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
 	'reading from slot with wal_level < logical fails');
-wait_standbys();
-restartpoint_standbys();
+sync_up();
 foreach my $node (@nodes)
 {
 	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:[^0][\d]*$/m,
 		"pg_controldata's oldestCatalogXmin is nonzero on " . $node->name);
 }
 
-# Dropping the slot must clear catalog_xmin
+# Drop the logical slot on the master; make sure feedback from standbys continues to peg
+# catalog_xmin.
 is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
 	'can drop logical slot while wal_level = replica');
-is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
-$node_master->safe_psql('postgres', 'VACUUM;');
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-wait_standbys();
-restartpoint_standbys();
+is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped on master');
+# Do a dummy xact so we can make sure catalog_xmin will advance, and we can see that
+# catalog_xmin will advance along with it.
+my $xmin = $node_master->safe_psql('postgres', 'BEGIN; CREATE TABLE dummy_xact(blah integer); SELECT txid_current(); COMMIT;');
+
+# even though the logical slot on the upstream is dropped, master's
+# oldestCatalogXmin is held down by hot standby feedback from the replicas.
+# Since the replicas have no logical slots of their own, it should've advanced
+# to be the same as the physical slot xmin for the slot replica.
+sync_up();
+# There are no transactions on the replicas so their xmin and catalog_xmin
+# will both be nextXid.
+cmp_ok($node_master->slot('slot_replica')->{'xmin'}, "eq", $xmin + 1,
+	'xmin advanced to latest master xid on slot_replica on master');
+cmp_ok($node_master->slot('slot_replica')->{'catalog_xmin'}, "le", $xmin + 1,
+	'xmin == catalog_xmin on phys slot held down by standby catalog_xmin');
+# Control files will still contain the xid, since there won't have been another
+# checkpoint to advance the nextXid reported by feedback and write it to the
+# control file.
+foreach my $node (@nodes)
+{
+	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:$xmin$/m,
+		"pg_controldata's oldestCatalogXmin advanced after drop, vacuum and checkpoint on " . $node->name);
+}
+
+# if we turn hot_standby_feedback off on the replica that uses a slot, the
+# master should no longer have anything holding down its catalog_xmin. Even
+# though hot_standby_feedback is still enabled on the non-slot replica, it
+# cannot set the master's catalog_xmin because it has no destination slot,
+# it can only set xmin in its procarray entry.
+$node_slot_replica->safe_psql('postgres', q[ALTER SYSTEM SET hot_standby_feedback = off;]);
+# simplest way to force new hot standby feedback to be sent
+$node_slot_replica->restart;
+sleep(1);
+# hot standby feedback should've cleared minimums
+is($node_master->slot('slot_replica')->{'xmin'}, '', 'phys slot xmin null with hs_feedback off');
+is($node_master->slot('slot_replica')->{'catalog_xmin'}, '', 'phys slot catalog_xmin null with hs_feedback off');
+sync_up();
+# Everyone should now see the cleared catalog_xmin
 foreach my $node (@nodes)
 {
 	command_like(['pg_controldata', $node->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
-		"pg_controldata's oldestCatalogXmin is zero after drop, vacuum and checkpoint on " . $node->name);
+		"pg_controldata's oldestCatalogXmin zero after turning off hs_feedback: " . $node->name);
 }
 
 foreach my $node (@nodes)
diff --git a/src/test/recovery/t/012_logical_decoding_on_replica.pl b/src/test/recovery/t/012_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..962c801
--- /dev/null
+++ b/src/test/recovery/t/012_logical_decoding_on_replica.pl
@@ -0,0 +1,506 @@
+#!/usr/bin/env perl
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 77;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--write-recovery-conf', '--slot=decoding_standby');
+
+open(my $fh, "<", $backup_dir . "/recovery.conf")
+  or die "can't open recovery.conf";
+
+my $found = 0;
+while (my $line = <$fh>)
+{
+	chomp($line);
+	if ($line eq "primary_slot_name = 'decoding_standby'")
+	{
+		$found = 1;
+		last;
+	}
+}
+ok($found, "using physical slot for standby");
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent
+
+# If no slot on standby exists to hold down catalog_xmin it must follow xmin,
+# (which is nextXid when no xacts are running on the standby).
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+is($xmin, $catalog_xmin, "xmin and catalog_xmin equal");
+
+# We need catalog_xmin advance to take effect on the master and be replayed
+# on standby.
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+cmp_ok($oldestCatalogXmin, ">=", $oldestXid, "oldestCatalogXmin >= oldestXid");
+cmp_ok($oldestCatalogXmin, "<=", $new_logical_catalog_xmin,, "oldestCatalogXmin >= downstream catalog_xmin");
+
+#########################################################
+# Conflict with recovery: xmin cancels decoding session
+#########################################################
+#
+# Start a transaction on the replica then perform work that should cause a
+# recovery conflict with it. We'll check to make sure the client gets
+# terminated with recovery conflict.
+#
+# Temporarily disable hs feedback so we can test recovery conflicts.
+# It's fine to continue using a physical slot, the xmin should be
+# cleared. We only check hot_standby_feedback when establishing
+# a new decoding session so this approach circumvents the safeguards
+# in place and forces a conflict.
+#
+# We'll also create an unrelated table so we can drop it later, making
+# sure there are catalog changes to replay.
+$node_master->safe_psql('testdb', 'CREATE TABLE dummy_table(blah integer)');
+
+# Start pg_recvlogical before we turn off hs_feedback so its slot's
+# catalog_xmin is above the downstream's catalog_threshold when we start
+# decoding.
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off');
+$node_replica->reload;
+
+sleep(2);
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "physical xmin null after hs_feedback disabled");
+is($catalog_xmin, '', "physical catalog_xmin null after hs_feedback disabled");
+
+# Burn a bunch of XIDs and make sure upstream catalog_xmin is past what we'll
+# need here
+($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+cmp_ok($oldestXid, ">", $new_logical_catalog_xmin, 'upstream oldestXid advanced past downstream catalog_xmin with hs_feedback off');
+cmp_ok($oldestCatalogXmin, "==", 0, "oldestCatalogXmin = InvalidTransactionId with hs_feedback off");
+
+# Make some data-only changes. We don't have a way to delay advance of the
+# catalog_xmin threshold until catalog changes are made, now that our slot is
+# no longer holding down catalog_xmin this will result in a recovery conflict.
+$node_master->safe_psql('testdb', 'DELETE FROM test_table');
+# Force a checkpoint to make sure catalog_xmin advances
+$node_master->safe_psql('testdb', 'CHECKPOINT;');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+$handle->pump;
+
+is($node_replica->slot('standby_logical')->{'active_pid'}, '', 'pg_recvlogical no longer connected to slot');
+
+# client died?
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server on recovery conflict");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict errmsg');
+	like($stderr, qr/requires catalog rows that will be removed/, 'pg_recvlogical exited with catalog_xmin conflict');
+}
+else
+{
+	fail("pg_recvlogical returned ok $return with stdout '$stdout', stderr '$stderr'");
+}
+
+# record the xmin when the conflicts arose
+my ($conflict_xmin, $conflict_catalog_xmin) = print_logical_xmin();
+
+#####################################################################
+# Conflict with recovery: oldestCatalogXmin should be zero with no feedback
+#####################################################################
+#
+# We cleared the catalog_xmin on the physical slot when hs feedback was turned
+# off. There's no logical slot on the master. So oldestCatalogXmin must be
+# zero.
+#
+$node_replica->safe_psql('postgres', 'CHECKPOINT');
+command_like(['pg_controldata', $node_replica->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:0$/m,
+	"pg_controldata's oldestCatalogXmin is zero when hot standby feedback is off");
+
+#####################################################################
+# Conflict with recovery: refuse to run without hot_standby_feedback
+#####################################################################
+#
+# When hot_standby_feedback is off, new connections should fail.
+#
+
+IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+is($?, 256, 'pg_recvlogical failed to connect to slot while hot_standby_feedback off');
+like($stderr, qr/hot_standby_feedback/, 'recvlogical recovery conflict errmsg');
+
+#####################################################################
+# Conflict with recovery: catalog_xmin advance invalidates idle slot
+#####################################################################
+#
+# The slot that pg_recvlogical was using before it was terminated
+# should not accept new connections now, since its catalog_xmin
+# is lower than the replica's threshold. Even once we re-enable
+# hot_standby_feedback, the removed tuples won't somehow come back.
+#
+
+$node_replica->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on');
+$node_replica->reload;
+# Wait until hot_standby_feedback is applied
+sleep(2);
+# make sure we see the effect promptly in xlog
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2);
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, 'xmin on phys slot non-null after re-establishing hot standby feedback');
+ok($catalog_xmin, 'catalog_xmin on phys slot non-null after re-establishing hot standby feedback')
+	or BAIL_OUT('further results meaningless if catalog_xmin not set on master');
+
+# The walsender will clamp the catalog_xmin on the slot, so when the standby sends
+# feedback with a too-old catalog_xmin the result will actually be limited to
+# the safe catalog_xmin.
+cmp_ok($catalog_xmin, ">=", $conflict_catalog_xmin,
+	'phys slot catalog_xmin has not rewound to replica logical slot catalog_xmin');
+
+print "catalog_xmin is $catalog_xmin";
+
+$node_replica->safe_psql('postgres', 'CHECKPOINT');
+command_like(['pg_controldata', $node_replica->data_dir], qr/^Latest checkpoint's oldestCatalogXmin:(?!$conflict_catalog_xmin)[^0][[:digit:]]*$/m,
+	"pg_controldata's oldestCatalogXmin has not rewound to slot catalog_xmin")
+	or BAIL_OUT('oldestCatalogXmin rewound, further tests are nonsensical');
+
+my $timer = IPC::Run::timeout(120);
+eval {
+	IPC::Run::run(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'standby_logical', '-f', '-', '--no-loop', '--start'],
+		'>', \$stdout, '2>', \$stderr, $timer);
+};
+ok(!$timer->is_expired, 'pg_recvlogical exited not timed out');
+is($?, 256, 'pg_recvlogical failed to connect to slot with past catalog_xmin');
+like($stderr, qr/replication slot '.*' requires catalogs removed by master/, 'recvlogical recovery conflict errmsg');
+
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_xmin, $new_catalog_xmin) = print_phys_xmin();
+# We're now back to the old behaviour of hot_standby_feedback
+# reporting nextXid for both thresholds
+ok($new_catalog_xmin, "physical catalog_xmin still non-null");
+cmp_ok($new_catalog_xmin, '==', $new_xmin,
+	'xmin and catalog_xmin equal after slot drop');
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot')
+	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot')
+	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.5.5

#99Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#98)
Re: Logical decoding on standby

On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:

On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

I tend to agree that it's late in the piece. It's still worth cleaning
it up into a state ready for early pg11 though.

Totally agreed.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#99)
Re: Logical decoding on standby

On Wed, Apr 5, 2017 at 10:32 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:

On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

I tend to agree that it's late in the piece. It's still worth cleaning
it up into a state ready for early pg11 though.

Totally agreed.

Based on this exchange, marked as "Moved to next CF".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#100)
Re: Logical decoding on standby

On 5 April 2017 at 23:25, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 5, 2017 at 10:32 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:

On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

I tend to agree that it's late in the piece. It's still worth cleaning
it up into a state ready for early pg11 though.

Totally agreed.

Based on this exchange, marked as "Moved to next CF".

Yeah. Can't say I like it, but I have to agree.

Can get this rolling in early pg11, and that way we can hopefully get
support for it into logical replication too.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102sanyam jain
sanyamjain22@live.in
In reply to: Robert Haas (#100)
Re: Logical decoding on standby

Hi,

In this patch in walsender.c sendTimeLineIsHistoric is set to true when current and ThisTimeLineID are equal.

sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;

Shouldn't sendTimeLineIsHistoric is true when state->currTLI is less than ThisTimeLineID.

When i applied the timeline following patch alone pg_recvlogical quits in startup phase but when i made the above change pg_recvlogical works although timeline following doesn't work.

Thanks,

Sanyam Jain

________________________________
From: pgsql-hackers-owner@postgresql.org <pgsql-hackers-owner@postgresql.org> on behalf of Robert Haas <robertmhaas@gmail.com>
Sent: Wednesday, April 5, 2017 3:25:50 PM
To: Andres Freund
Cc: Craig Ringer; Simon Riggs; Thom Brown; Michael Paquier; Petr Jelinek; PostgreSQL Hackers
Subject: Re: [HACKERS] Logical decoding on standby

On Wed, Apr 5, 2017 at 10:32 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:

On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

I tend to agree that it's late in the piece. It's still worth cleaning
it up into a state ready for early pg11 though.

Totally agreed.

Based on this exchange, marked as "Moved to next CF".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103sanyam jain
sanyamjain22@live.in
In reply to: sanyam jain (#102)
Re: Logical decoding on standby

Hi,
After changing
sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
to
sendTimeLineIsHistoric = state->currTLI != ThisTimeLineID;

I was facing another issue.
On promotion of a cascaded server ThisTimeLineID in the standby server having logical slot becomes 0.
Then i added a function call to GetStandbyFlushRecPtr in StartLogicalReplication which updates ThisTimeLineID.

After the above two changes timeline following is working.But i'm not sure whether this is correct or not.In any case please someone clarify.

Thanks,
Sanyam Jain

#104sanyam jain
sanyamjain22@live.in
In reply to: sanyam jain (#103)
Re: Logical decoding on standby

Hi,

After changing
sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
to
sendTimeLineIsHistoric = state->currTLI != ThisTimeLineID;

I was facing another issue.
On promotion of a cascaded server ThisTimeLineID in the standby server having >logical slot becomes 0.
Then i added a function call to GetStandbyFlushRecPtr in StartLogicalReplication >which updates ThisTimeLineID.

After the above two changes timeline following is working.But i'm not sure whether >this is correct or not.In any case please someone clarify.

Please anyone with experience can explain whether the steps i have done are correct or not.

Thanks,
Sanyam Jain

#105Craig Ringer
craig@2ndquadrant.com
In reply to: sanyam jain (#102)
Re: Logical decoding on standby

On 21 June 2017 at 13:28, sanyam jain <sanyamjain22@live.in> wrote:

Hi,

In this patch in walsender.c sendTimeLineIsHistoric is set to true when
current and ThisTimeLineID are equal.

sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;

Shouldn't sendTimeLineIsHistoric is true when state->currTLI is less than
ThisTimeLineID.

Correct, that was a bug. I thought it got fixed upthread though.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Craig Ringer
craig@2ndquadrant.com
In reply to: sanyam jain (#103)
Re: Logical decoding on standby

On 21 June 2017 at 17:30, sanyam jain <sanyamjain22@live.in> wrote:

Hi,
After changing
sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
to
sendTimeLineIsHistoric = state->currTLI != ThisTimeLineID;

I was facing another issue.
On promotion of a cascaded server ThisTimeLineID in the standby server
having logical slot becomes 0.
Then i added a function call to GetStandbyFlushRecPtr in
StartLogicalReplication which updates ThisTimeLineID.

After the above two changes timeline following is working.But i'm not sure
whether this is correct or not.In any case please someone clarify.

That's a reasonable thing to do, and again, I thought I did it in a
later revision, but apparently not (?). I've been working on other
things and have lost track of progress here a bit.

I'll check more closely.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#106)
Re: [HACKERS] Logical decoding on standby

On 27 June 2017 at 13:24, Craig Ringer <craig@2ndquadrant.com> wrote:

On 21 June 2017 at 17:30, sanyam jain <sanyamjain22@live.in> wrote:

Hi,
After changing
sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
to
sendTimeLineIsHistoric = state->currTLI != ThisTimeLineID;

I was facing another issue.
On promotion of a cascaded server ThisTimeLineID in the standby server
having logical slot becomes 0.
Then i added a function call to GetStandbyFlushRecPtr in
StartLogicalReplication which updates ThisTimeLineID.

After the above two changes timeline following is working.But i'm not

sure

whether this is correct or not.In any case please someone clarify.

That's a reasonable thing to do, and again, I thought I did it in a
later revision, but apparently not (?). I've been working on other
things and have lost track of progress here a bit.

I'll check more closely.

Hi all.

I've had to backburner this due to other work. In the process of looking
into an unrelated bug recently though, I noticed that the way we handle
snapshots may not be safe for historic snaphots on a standby. Historic
snapshots don't ever set takenDuringRecovery, which allows heapgetpage to
trust PD_IS_VISIBLE on a page. According to comments on heapgetpage that
could be an issue.

Minor compared to some of the other things that'll come up when finishing
this off, but worth remembering.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services