snapbuild woes

Started by Petr Jelinekabout 9 years ago92 messages
#1Petr Jelinek
petr.jelinek@2ndquadrant.com
2 attachment(s)

Hi,

I recently found couple of issues with the way initial logical decoding
snapshot is made.

First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged. This in
turn means that the snapbuild will never make snapshot if the situation
occurs. The reason why this didn't bite us much yet is that the
snapbuild will consider all transactions finished if the empty
xl_running_xacts comes which usually happens after a while (but might
not on a consistently busy server).

The fix I came up with this is to extend the mechanism to also consider
all transactions finished when xl_running_xacts which has xmin bigger
then that previous xmax comes. This seems to work pretty well on busy
servers. This fix is attached as
0001-Mark-snapshot-consistent-when-all-running-txes-have.patch. I
believe it should be backpatched all the way to 9.4.

The other issue is performance problem again on busy servers with
initial snapshot. We track transactions for catalog modifications so
that we can do proper catalog time travel for decoding of changes. But
for transactions that were running while we started trying to get
initial consistent snapshot, there is no good way to tell if they did
catalog changes or not so we consider them all as catalog changing. We
make separate historical snapshot for every such transaction. This by
itself is fine, the problem is that current implementation also
considers all the transactions that started after we started watching
for changes but before we reached consistent state to also do catalog
changes even though there we actually do know if they did any catalog
change or not. In practice it means we make snapshots that are not
really necessary and if there was long running transaction for which the
snapshot builder has to wait for then we can create thousands of unused
snapshots which affects performance in bad ways (I've seen the initial
snapshot taking hour because of this).

The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
behavior so that we don't make snapshots for transactions that we seen
wholly and know that they didn't make catalog changes even if we didn't
reach consistency yet.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Mark-snapshot-consistent-when-all-running-txes-have.patchapplication/x-patch; name=0001-Mark-snapshot-consistent-when-all-running-txes-have.patchDownload
From 67e902eef33da9e241c079eaed9ab12a88616296 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sat, 10 Dec 2016 21:56:44 +0100
Subject: [PATCH 1/2] Mark snapshot consistent when all running txes have
 finished.

---
 src/backend/replication/logical/snapbuild.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 8b59fc5..5836d52 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1213,9 +1213,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * Build catalog decoding snapshot incrementally using information about
 	 * the currently running transactions. There are several ways to do that:
 	 *
-	 * a) There were no running transactions when the xl_running_xacts record
-	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 * a) Either there were no running transactions when the xl_running_xacts
+	 *    record was inserted (we might find this while waiting for b) or c))
+	 *    or the running transactions we've been tracking have all finished
+	 *    by the time the xl_running_xacts was inserted. We can jump to
+	 *    CONSISTENT immediately.
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1251,13 +1253,18 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	}
 
 	/*
-	 * a) No transaction were running, we can jump to consistent.
+	 * a) Either no transaction were running or we've been tracking running
+	 *    transactions and the new snapshot has them all finished, we can
+	 *    jump to consistent.
 	 *
-	 * NB: We might have already started to incrementally assemble a snapshot,
-	 * so we need to be careful to deal with that.
+	 * NB: Since we might have already started to incrementally assemble a
+	 * snapshot, we need to be careful to deal with that.
 	 */
-	if (running->xcnt == 0)
+	if (running->xcnt == 0 ||
+		(builder->running.xcnt > 0 &&
+		 running->oldestRunningXid > builder->running.xmax))
 	{
+
 		if (builder->start_decoding_at == InvalidXLogRecPtr ||
 			builder->start_decoding_at <= lsn)
 			/* can decode everything after this */
@@ -1278,10 +1285,16 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		builder->state = SNAPBUILD_CONSISTENT;
 
+		/*
+		 * Give different log detail based on actual state for easier
+		 * debugging.
+		 */
 		ereport(LOG,
 				(errmsg("logical decoding found consistent point at %X/%X",
 						(uint32) (lsn >> 32), (uint32) lsn),
-				 errdetail("There are no running transactions.")));
+				 errdetail(running->xcnt == 0 ?
+						   "There are no running transactions." :
+						   "All running transactions have finished.")));
 
 		return false;
 	}
-- 
2.7.4

0002-Skip-unnecessary-snapshot-builds.patchapplication/x-patch; name=0002-Skip-unnecessary-snapshot-builds.patchDownload
From 31ea8f255e531c9327f7a5d55c5c0c756b37739c Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sat, 10 Dec 2016 22:22:13 +0100
Subject: [PATCH 2/2] Skip unnecessary snapshot builds

---
 src/backend/replication/logical/snapbuild.c | 38 +++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5836d52..3cf4829 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -955,6 +955,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		build_snapshot = true;
 
 	TransactionId xmax = xid;
 
@@ -976,10 +977,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			build_snapshot = false;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -1069,15 +1079,25 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
 			return;
 
+		/* Make sure we always build snapshot if there is no existing one. */
+		build_snapshot = build_snapshot || !builder->snapshot;
+
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
 		 */
-		if (builder->snapshot)
+		if (builder->snapshot && build_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (build_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1087,11 +1107,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (build_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

#2Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#1)
1 attachment(s)
Re: snapbuild woes

On 10/12/16 23:10, Petr Jelinek wrote:

The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
behavior so that we don't make snapshots for transactions that we seen
wholly and know that they didn't make catalog changes even if we didn't
reach consistency yet.

Eh, attached wrong patch. This one is correct.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0002-Skip-unnecessary-snapshot-builds.patchapplication/x-patch; name=0002-Skip-unnecessary-snapshot-builds.patchDownload
From 2add068ed38c2887e6652396c280695a7e384fe7 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sat, 10 Dec 2016 22:22:13 +0100
Subject: [PATCH 2/2] Skip unnecessary snapshot builds

---
 src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5836d52..ea3f40f 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -955,6 +955,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -976,10 +977,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -992,21 +1002,10 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 
@@ -1018,6 +1017,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
 	/*
@@ -1026,14 +1035,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
 	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
@@ -1046,10 +1049,18 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (forced_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
 	{
+		bool build_snapshot;
+
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
 		 * catalog modifying, transactions, everything else isn't interesting
@@ -1070,14 +1081,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			return;
 
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Build snapshot if needed. We need to build it if there isn't one
+		 * already built, or if the transaction has made catalog changes or
+		 * when we can't know if transaction made catalog changes.
 		 */
-		if (builder->snapshot)
+		build_snapshot = !builder->snapshot || top_needs_timetravel ||
+			sub_needs_timetravel || !skip_forced_snapshot;
+
+		/*
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
+		 */
+		if (builder->snapshot && build_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (build_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1087,11 +1113,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (build_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

#3Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Petr Jelinek (#2)
Re: snapbuild woes

On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote:

On 10/12/16 23:10, Petr Jelinek wrote:

The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
behavior so that we don't make snapshots for transactions that we seen
wholly and know that they didn't make catalog changes even if we didn't
reach consistency yet.

Eh, attached wrong patch. This one is correct.

Attached no patch second time?

#4Kevin Grittner
kgrittn@gmail.com
In reply to: Craig Ringer (#3)
Re: snapbuild woes

On Sun, Dec 11, 2016 at 1:17 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote:

On 10/12/16 23:10, Petr Jelinek wrote:

The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
behavior so that we don't make snapshots for transactions that we seen
wholly and know that they didn't make catalog changes even if we didn't
reach consistency yet.

Eh, attached wrong patch. This one is correct.

Attached no patch second time?

I see an attachment, and it shows in the archives.

/messages/by-id/aee1d499-e3ca-e091-56da-1ee6a47741c8@2ndquadrant.com

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Kevin Grittner (#4)
Re: snapbuild woes

On 12 December 2016 at 00:36, Kevin Grittner <kgrittn@gmail.com> wrote:

On Sun, Dec 11, 2016 at 1:17 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote:

On 10/12/16 23:10, Petr Jelinek wrote:

The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
behavior so that we don't make snapshots for transactions that we seen
wholly and know that they didn't make catalog changes even if we didn't
reach consistency yet.

Eh, attached wrong patch. This one is correct.

Attached no patch second time?

I see an attachment, and it shows in the archives.

/messages/by-id/aee1d499-e3ca-e091-56da-1ee6a47741c8@2ndquadrant.com

Sorry for the noise, apparently my phone's mail client was being dense.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#1)
Re: snapbuild woes

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#6)
Re: snapbuild woes

On 12/12/16 22:42, Andres Freund wrote:

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#7)
Re: snapbuild woes

On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:

On 12/12/16 22:42, Andres Freund wrote:

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

I suspect the reason for that is that RecordTransactionCommit() doesn't
conflict with ProcArrayLock in the first place - only
ProcArrayEndTransaction() does. So they're still running in the PGPROC
sense, just not the crash-recovery sense...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#8)
Re: snapbuild woes

On 12/12/16 23:33, Andres Freund wrote:

On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:

On 12/12/16 22:42, Andres Freund wrote:

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

I suspect the reason for that is that RecordTransactionCommit() doesn't
conflict with ProcArrayLock in the first place - only
ProcArrayEndTransaction() does. So they're still running in the PGPROC
sense, just not the crash-recovery sense...

That looks like reasonable explanation. BTW I realized my patch needs
bit more work, currently it will break the actual snapshot as it behaves
same as if the xl_running_xacts was empty which is not correct AFAICS.

Also if we did the approach suggested by my patch (ie using this
xmin/xmax comparison) I guess we wouldn't need to hold the lock for
extra time in wal_level logical anymore.

That is of course unless you think it should be approached from the
other side of the stream and try log correct xl_running_xacts.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#9)
3 attachment(s)
Re: snapbuild woes

On 13/12/16 00:38, Petr Jelinek wrote:

On 12/12/16 23:33, Andres Freund wrote:

On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:

On 12/12/16 22:42, Andres Freund wrote:

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

I suspect the reason for that is that RecordTransactionCommit() doesn't
conflict with ProcArrayLock in the first place - only
ProcArrayEndTransaction() does. So they're still running in the PGPROC
sense, just not the crash-recovery sense...

That looks like reasonable explanation. BTW I realized my patch needs
bit more work, currently it will break the actual snapshot as it behaves
same as if the xl_running_xacts was empty which is not correct AFAICS.

Hi,

I got to work on this again. Unfortunately I haven't found solution that
I would be very happy with. What I did is in case we read
xl_running_xacts which has all transactions we track finished, we start
tracking from that new xl_running_xacts again with the difference that
we clean up the running transactions based on previously seen committed
ones. That means that on busy server we may wait for multiple
xl_running_xacts rather than just one, but at least we have chance to
finish unlike with current coding which basically waits for empty
xl_running_xacts. I also removed the additional locking for logical
wal_level in LogStandbySnapshot() since it does not work.

I also identified another bug in snapbuild while looking at the code.
That is the logical decoding will try to use on disk serialized snapshot
for initial snapshot export when it can. The problem is that these
snapshots are quite special and are not really usable as snapshots for
data (ie, the logical decoding snapshots regularly have xmax smaller
than xmin). So then when client tries to use this exported snapshot it
gets completely wrong data as the snapshot is broken. I think this is
explanation for Erik Rijker's problems with the initial COPY patch for
logical replication. At least for me the issues go away when I disable
use of the ondisk snapshots.

I didn't really find better solution than that though (disabling the use
of ondisk snapshots for initial consistent snapshot).

So to summarize attached patches:
0001 - Fixes performance issue where we build tons of snapshots that we
don't need which kills CPU.

0002 - Disables the use of ondisk historical snapshots for initial
consistent snapshot export as it may result in corrupt data. This
definitely needs backport.

0003 - Fixes bug where we might never reach snapshot on busy server due
to race condition in xl_running_xacts logging. The original use of extra
locking does not seem to be enough in practice. Once we have agreed fix
for this it's probably worth backpatching. There are still some comments
that need updating, this is more of a PoC.

Thoughts or better ideas?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Skip-unnecessary-snapshot-builds.patchtext/x-patch; name=0001-Skip-unnecessary-snapshot-builds.patchDownload
From e8d4dc52bc9dc7e5b17f4e374319fda229b19e61 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 1/3] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
 src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed9f69f..c2476a9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -972,6 +972,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -993,10 +994,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -1009,21 +1019,10 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 
@@ -1035,6 +1034,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
 	/*
@@ -1043,14 +1052,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
 	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
@@ -1063,10 +1066,18 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (forced_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
 	{
+		bool build_snapshot;
+
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
 		 * catalog modifying, transactions, everything else isn't interesting
@@ -1087,14 +1098,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			return;
 
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Build snapshot if needed. We need to build it if there isn't one
+		 * already built, or if the transaction has made catalog changes or
+		 * when we can't know if transaction made catalog changes.
 		 */
-		if (builder->snapshot)
+		build_snapshot = !builder->snapshot || top_needs_timetravel ||
+			sub_needs_timetravel || !skip_forced_snapshot;
+
+		/*
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
+		 */
+		if (builder->snapshot && build_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (build_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1104,11 +1130,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (build_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchtext/x-patch; name=0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchDownload
From dcedfdafce82ae65656e32468d84b084e629a70b Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 20:14:44 +0100
Subject: [PATCH 2/3] Don't use on disk snapshots for snapshot export in
 logical decoding

We store historical snapshots on disk to enable continuation of logical
decoding after restart. These snapshots were reused by the
slot initialization code when searching for consistent snapshot. However
these snapshots are only useful for catalogs and not for normal user
tables. So when we exported such snapshots for user to read data from
tables that is consistent with a specific LSN of slot creation, user
would instead read wrong data. There does not seem to be simple way to
make the logical decoding historical snapshots useful for normal tables
so don't use them for exporting at all for now.
---
 src/backend/replication/logical/snapbuild.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c2476a9..0b10044 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1252,11 +1252,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 {
 	/* ---
 	 * Build catalog decoding snapshot incrementally using information about
-	 * the currently running transactions. There are several ways to do that:
+	 * the currently running transactions. There are couple ways to do that:
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 *	  state we were waiting for b).
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1269,9 +1269,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  Interestingly, in contrast to HS, this allows us not to care about
 	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
 	 *	  at all.
-	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.
 	 * ---
 	 */
 
@@ -1326,13 +1323,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
-	{
-		/* there won't be any state to cleanup */
-		return false;
-	}
-
 	/*
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
-- 
2.7.4

0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patchtext/x-patch; name=0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patchDownload
From 501e6c9842e225a90c52b8bf3f91adba03d38390 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 3/3] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.
---
 src/backend/replication/logical/snapbuild.c | 79 +++++++++++++++++++++++------
 src/backend/storage/ipc/standby.c           |  7 +--
 2 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0b10044..e5f9a33 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1262,7 +1262,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().
 	 *	  NB: We need to search running.xip when seeing a transaction's end to
 	 *	  make sure it's a toplevel transaction and it's been one of the
 	 *	  initially running ones.
@@ -1327,11 +1332,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
 	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.
 	 */
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)
 	{
 		int			off;
+		int			purge_running = builder->running.xcnt > 0;
 
 		/*
 		 * We only care about toplevel xids as those are the ones we
@@ -1367,26 +1378,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		builder->running.xmin = builder->running.xip[0];
 		builder->running.xmax = builder->running.xip[running->xcnt - 1];
 
+
 		/* makes comparisons cheaper later */
 		TransactionIdRetreat(builder->running.xmin);
 		TransactionIdAdvance(builder->running.xmax);
 
 		builder->state = SNAPBUILD_FULL_SNAPSHOT;
 
-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
 		/*
-		 * Iterate through all xids, wait for them to finish.
-		 *
-		 * This isn't required for the correctness of decoding, but to allow
-		 * isolationtester to notice that we're currently waiting for
-		 * something.
+		 * Iterate through all xids and do additional checking/purging.
 		 */
 		for (off = 0; off < builder->running.xcnt; off++)
 		{
@@ -1400,9 +1400,56 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 			if (TransactionIdIsCurrentTransactionId(xid))
 				elog(ERROR, "waiting for ourselves");
 
+			/*
+			 * Use gathered info about committed transactions to purge
+			 * committed transactions recorded xl_running_xacts as running
+			 * because of race condition in LogStandbySnapshot(). This may
+			 * be slow but it should be called at most once per slot
+			 * initialization.
+			 */
+			if (purge_running)
+			{
+				int i;
+
+				for (i = 0; i < builder->committed.xcnt; i++)
+				{
+					if (builder->committed.xip[i] == xid)
+					{
+						SnapBuildEndTxn(builder, lsn, xid);
+						continue;
+					}
+				}
+			}
+
+			/*
+			 * This isn't required for the correctness of decoding, but to allow
+			 * isolationtester to notice that we're currently waiting for
+			 * something.
+			 */
 			XactLockTableWait(xid, NULL, NULL, XLTW_None);
 		}
 
+		if (!purge_running)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+
 		/* nothing could have built up so far, so don't perform cleanup */
 		return false;
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..9b41a28 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -945,15 +945,10 @@ LogStandbySnapshot(void)
 	 * record. Fortunately this routine isn't executed frequently, and it's
 	 * only a shared lock.
 	 */
-	if (wal_level < WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
+	LWLockRelease(ProcArrayLock);
 
 	recptr = LogCurrentRunningXacts(running);
 
-	/* Release lock if we kept it longer ... */
-	if (wal_level >= WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
-- 
2.7.4

#11Erik Rijkers
er@xs4all.nl
In reply to: Petr Jelinek (#10)
Re: snapbuild woes

On 2017-02-22 03:05, Petr Jelinek wrote:

So to summarize attached patches:
0001 - Fixes performance issue where we build tons of snapshots that we
don't need which kills CPU.

0002 - Disables the use of ondisk historical snapshots for initial
consistent snapshot export as it may result in corrupt data. This
definitely needs backport.

0003 - Fixes bug where we might never reach snapshot on busy server due
to race condition in xl_running_xacts logging. The original use of
extra
locking does not seem to be enough in practice. Once we have agreed fix
for this it's probably worth backpatching. There are still some
comments
that need updating, this is more of a PoC.

I am not not entirely sure what to expect. Should a server with these 3
patches do initial data copy or not? The sgml seems to imply there is
not inital data copy. But my test does copy something.

Anyway, I have repeated the same old pgbench-test, assuming inital data
copy should be working.

With

0001-Skip-unnecessary-snapshot-builds.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch

the consistent (but wrong) end state is always that only one of the four
pgbench tables, pgbench_history, is replicated (always correctly).

Below is the output from the test (I've edited the lines for email)
(below, a,b,t,h stand for: pgbench_accounts, pgbench_branches,
pgbench_tellers, pgbench_history)
(master on port 6972, replica on port 6973.)

port
6972 a,b,t,h: 100000 1 10 347
6973 a,b,t,h: 0 0 0 347

a,b,t,h: a68efc81a 2c27f7ba5 128590a57 1e4070879 master
a,b,t,h: d41d8cd98 d41d8cd98 d41d8cd98 1e4070879 replica NOK

The md5-initstrings are from a md5 of the whole content of each table
(an ordered select *)

I repeated this a few times: of course, the number of rows in
pgbench_history varies a bit but otherwise it is always the same: 3
empty replica tables, pgbench_history replicated correctly.

Something is not right.

thanks,

Erik Rijkers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Erik Rijkers (#11)
Re: snapbuild woes

On 22/02/17 11:29, Erik Rijkers wrote:

On 2017-02-22 03:05, Petr Jelinek wrote:

So to summarize attached patches:
0001 - Fixes performance issue where we build tons of snapshots that we
don't need which kills CPU.

0002 - Disables the use of ondisk historical snapshots for initial
consistent snapshot export as it may result in corrupt data. This
definitely needs backport.

0003 - Fixes bug where we might never reach snapshot on busy server due
to race condition in xl_running_xacts logging. The original use of extra
locking does not seem to be enough in practice. Once we have agreed fix
for this it's probably worth backpatching. There are still some comments
that need updating, this is more of a PoC.

I am not not entirely sure what to expect. Should a server with these 3
patches do initial data copy or not? The sgml seems to imply there is
not inital data copy. But my test does copy something.

Not by itself (without the copy patch), those fixes are for snapshots.

With

0001-Skip-unnecessary-snapshot-builds.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch

the consistent (but wrong) end state is always that only one of the four
pgbench tables, pgbench_history, is replicated (always correctly).

Below is the output from the test (I've edited the lines for email)
(below, a,b,t,h stand for: pgbench_accounts, pgbench_branches,
pgbench_tellers, pgbench_history)
(master on port 6972, replica on port 6973.)

port
6972 a,b,t,h: 100000 1 10 347
6973 a,b,t,h: 0 0 0 347

a,b,t,h: a68efc81a 2c27f7ba5 128590a57 1e4070879 master
a,b,t,h: d41d8cd98 d41d8cd98 d41d8cd98 1e4070879 replica NOK

The md5-initstrings are from a md5 of the whole content of each table
(an ordered select *)

I repeated this a few times: of course, the number of rows in
pgbench_history varies a bit but otherwise it is always the same: 3
empty replica tables, pgbench_history replicated correctly.

That's actually correct behaviour without the initial copy patch, it
replicates changes, but since the 3 tables only get updates there is
nothing to replicate as there is no data downstream. However inserts
will of course work fine even without data downstream.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#10)
4 attachment(s)
Re: snapbuild woes

On 22/02/17 03:05, Petr Jelinek wrote:

On 13/12/16 00:38, Petr Jelinek wrote:

On 12/12/16 23:33, Andres Freund wrote:

On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:

On 12/12/16 22:42, Andres Freund wrote:

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

I suspect the reason for that is that RecordTransactionCommit() doesn't
conflict with ProcArrayLock in the first place - only
ProcArrayEndTransaction() does. So they're still running in the PGPROC
sense, just not the crash-recovery sense...

That looks like reasonable explanation. BTW I realized my patch needs
bit more work, currently it will break the actual snapshot as it behaves
same as if the xl_running_xacts was empty which is not correct AFAICS.

Hi,

I got to work on this again. Unfortunately I haven't found solution that
I would be very happy with. What I did is in case we read
xl_running_xacts which has all transactions we track finished, we start
tracking from that new xl_running_xacts again with the difference that
we clean up the running transactions based on previously seen committed
ones. That means that on busy server we may wait for multiple
xl_running_xacts rather than just one, but at least we have chance to
finish unlike with current coding which basically waits for empty
xl_running_xacts. I also removed the additional locking for logical
wal_level in LogStandbySnapshot() since it does not work.

Not hearing any opposition to this idea so I decided to polish this and
also optimize it a bit.

That being said, thanks to testing from Erik Rijkers I've identified one
more bug in how we do the initial snapshot. Apparently we don't reserve
the global xmin when we start building the initial exported snapshot for
a slot (we only reserver catalog_xmin which is fine for logical decoding
but not for the exported snapshot) so the VACUUM and heap pruning will
happily delete old versions of rows that are still needed by anybody
trying to use that exported snapshot.

Attached are updated versions of patches:

0001 - Fixes the above mentioned global xmin tracking issues. Needs to
be backported all the way to 9.4

0002 - Removes use of the logical decoding saved snapshots for initial
exported snapshot since those snapshots only work for catalogs and not
user data. Also needs to be backported all the way to 9.4.

0003 - Changes handling of the xl_running_xacts in initial snapshot
build to what I wrote above and removes the extra locking from
LogStandbySnapshot introduced by logical decoding.

0004 - Improves performance of initial snapshot building by skipping
catalog snapshot build for transactions that don't do catalog changes.

The 0001 and 0002 are bug fixes because without them the exported
snapshots are basically corrupted. The 0003 and 0004 are performance
improvements, but on busy servers the snapshot export might never happen
so it's for rather serious performance issues.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchtext/x-patch; name=snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchDownload
From 67a44702ff146756b33e8d15e91a02f5d9e86792 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/4] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
 src/backend/replication/logical/logical.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..9062244 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
 	 * protecting against vacuum.
+	 *
+	 * Note that we only store the global xmin temporarily so that the initial
+	 * snapshot can be exported. After initial snapshot is done global xmin
+	 * should be reset and not tracked anymore.
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	slot->effective_xmin = slot->effective_catalog_xmin;
+	slot->data.xmin = slot->effective_catalog_xmin;
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
@@ -282,7 +288,7 @@ CreateInitDecodingContext(char *plugin,
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
 	 */
-	xmin_horizon = slot->data.catalog_xmin;
+	xmin_horizon = slot->effective_xmin;
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
@@ -456,12 +462,29 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 void
 FreeDecodingContext(LogicalDecodingContext *ctx)
 {
+	ReplicationSlot *slot = MyReplicationSlot;
+
 	if (ctx->callbacks.shutdown_cb != NULL)
 		shutdown_cb_wrapper(ctx);
 
-	ReorderBufferFree(ctx->reorder);
-	FreeSnapshotBuilder(ctx->snapshot_builder);
-	XLogReaderFree(ctx->reader);
+	/*
+	 * Cleanup global xmin for the slot that we may have set in
+	 * CreateInitDecodingContext(). We do not take ProcArrayLock or similar
+	 * since we only reset xmin here and there's not much harm done by a
+	 * concurrent computation missing that.
+	 */
+	SpinLockAcquire(&slot->mutex);
+	slot->effective_xmin = InvalidTransactionId;
+	slot->data.xmin = InvalidTransactionId;
+	SpinLockRelease(&slot->mutex);
+	ReplicationSlotsComputeRequiredXmin(false);
+
+	if (ctx->reorder)
+		ReorderBufferFree(ctx->reorder);
+	if (ctx->snapshot_builder)
+		FreeSnapshotBuilder(ctx->snapshot_builder);
+	if (ctx->reader)
+		XLogReaderFree(ctx->reader);
 	MemoryContextDelete(ctx->context);
 }
 
-- 
2.7.4

snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchtext/x-patch; name=snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchDownload
From 10bfdabe76deaf49ae019321fe91720ce6d2ce71 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 20:14:44 +0100
Subject: [PATCH 2/4] Don't use on disk snapshots for snapshot export in
 logical decoding

We store historical snapshots on disk to enable continuation of logical
decoding after restart. These snapshots were reused by the
slot initialization code when searching for consistent snapshot. However
these snapshots are only useful for catalogs and not for normal user
tables. So when we exported such snapshots for user to read data from
tables that is consistent with a specific LSN of slot creation, user
would instead read wrong data. There does not seem to be simple way to
make the logical decoding historical snapshots useful for normal tables
so don't use them for exporting at all for now.
---
 src/backend/replication/logical/snapbuild.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c0f28dd..143e8ec 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1211,11 +1211,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 {
 	/* ---
 	 * Build catalog decoding snapshot incrementally using information about
-	 * the currently running transactions. There are several ways to do that:
+	 * the currently running transactions. There are couple ways to do that:
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 *	  state we were waiting for b).
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1228,9 +1228,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  Interestingly, in contrast to HS, this allows us not to care about
 	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
 	 *	  at all.
-	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.
 	 * ---
 	 */
 
@@ -1285,13 +1282,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
-	{
-		/* there won't be any state to cleanup */
-		return false;
-	}
-
 	/*
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
-- 
2.7.4

snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patchtext/x-patch; name=snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patchDownload
From 2f4d36456430e0974d9348f9c4ea5ece6656c544 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 3/4] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.

This also reverts changes made to GetRunningTransactionData() and
LogStandbySnapshot() by b89e151 as the additional locking does not help.
---
 src/backend/replication/logical/snapbuild.c | 65 ++++++++++++++++++++++++-----
 src/backend/storage/ipc/procarray.c         |  5 ++-
 src/backend/storage/ipc/standby.c           | 19 ---------
 3 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 143e8ec..40937fb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().
 	 *	  NB: We need to search running.xip when seeing a transaction's end to
 	 *	  make sure it's a toplevel transaction and it's been one of the
 	 *	  initially running ones.
@@ -1286,9 +1291,14 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
 	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.
 	 */
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)
 	{
 		int			off;
 
@@ -1326,20 +1336,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		builder->running.xmin = builder->running.xip[0];
 		builder->running.xmax = builder->running.xip[running->xcnt - 1];
 
+
 		/* makes comparisons cheaper later */
 		TransactionIdRetreat(builder->running.xmin);
 		TransactionIdAdvance(builder->running.xmax);
 
 		builder->state = SNAPBUILD_FULL_SNAPSHOT;
 
-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
 		/*
 		 * Iterate through all xids, wait for them to finish.
 		 *
@@ -1359,9 +1362,49 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 			if (TransactionIdIsCurrentTransactionId(xid))
 				elog(ERROR, "waiting for ourselves");
 
+			/*
+			 * This isn't required for the correctness of decoding, but to allow
+			 * isolationtester to notice that we're currently waiting for
+			 * something.
+			 */
 			XactLockTableWait(xid, NULL, NULL, XLTW_None);
 		}
 
+		/*
+		 * Because of the race condition in LogStandbySnapshot() the
+		 * transactions recorded in xl_running_xacts as running might have
+		 * already committed by the time the xl_running_xacts was written
+		 * to WAL. Use the information about decoded transactions that we
+		 * gathered so far to update our idea about what's still running.
+		 *
+		 * We can use SnapBuildEndTxn directly as it only does the transaction
+		 * running check and handling without any additional side effects.
+		 */
+		for (off = 0; off < builder->committed.xcnt; off++)
+			SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
+
+		/* Report which action we actually did here. */
+		if (!builder->running.xcnt)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+
 		/* nothing could have built up so far, so don't perform cleanup */
 		return false;
 	}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cd14667..4ea81f8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2055,12 +2055,13 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
 
+	/* We don't release XidGenLock here, the caller is responsible for that */
+	LWLockRelease(ProcArrayLock);
+
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
 	Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));
 
-	/* We don't release the locks here, the caller is responsible for that */
-
 	return CurrentRunningXacts;
 }
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..f461f21 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -933,27 +933,8 @@ LogStandbySnapshot(void)
 	 */
 	running = GetRunningTransactionData();
 
-	/*
-	 * GetRunningTransactionData() acquired ProcArrayLock, we must release it.
-	 * For Hot Standby this can be done before inserting the WAL record
-	 * because ProcArrayApplyRecoveryInfo() rechecks the commit status using
-	 * the clog. For logical decoding, though, the lock can't be released
-	 * early because the clog might be "in the future" from the POV of the
-	 * historic snapshot. This would allow for situations where we're waiting
-	 * for the end of a transaction listed in the xl_running_xacts record
-	 * which, according to the WAL, has committed before the xl_running_xacts
-	 * record. Fortunately this routine isn't executed frequently, and it's
-	 * only a shared lock.
-	 */
-	if (wal_level < WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	recptr = LogCurrentRunningXacts(running);
 
-	/* Release lock if we kept it longer ... */
-	if (wal_level >= WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
-- 
2.7.4

snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patchtext/x-patch; name=snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patchDownload
From f88add9c98f1563a602fbd4dd34d18149d6c7058 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 4/4] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
 src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 40937fb..9f536b0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -955,6 +955,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -976,10 +977,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -992,21 +1002,10 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 
@@ -1018,6 +1017,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
 	/*
@@ -1026,14 +1035,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
 	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
@@ -1046,10 +1049,18 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (forced_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
 	{
+		bool build_snapshot;
+
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
 		 * catalog modifying, transactions, everything else isn't interesting
@@ -1070,14 +1081,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			return;
 
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Build snapshot if needed. We need to build it if there isn't one
+		 * already built, or if the transaction has made catalog changes or
+		 * when we can't know if transaction made catalog changes.
 		 */
-		if (builder->snapshot)
+		build_snapshot = !builder->snapshot || top_needs_timetravel ||
+			sub_needs_timetravel || !skip_forced_snapshot;
+
+		/*
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
+		 */
+		if (builder->snapshot && build_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (build_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1087,11 +1113,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (build_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

#14Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#13)
5 attachment(s)
Re: snapbuild woes

On 24/02/17 22:56, Petr Jelinek wrote:

On 22/02/17 03:05, Petr Jelinek wrote:

On 13/12/16 00:38, Petr Jelinek wrote:

On 12/12/16 23:33, Andres Freund wrote:

On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:

On 12/12/16 22:42, Andres Freund wrote:

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

I suspect the reason for that is that RecordTransactionCommit() doesn't
conflict with ProcArrayLock in the first place - only
ProcArrayEndTransaction() does. So they're still running in the PGPROC
sense, just not the crash-recovery sense...

That looks like reasonable explanation. BTW I realized my patch needs
bit more work, currently it will break the actual snapshot as it behaves
same as if the xl_running_xacts was empty which is not correct AFAICS.

Hi,

I got to work on this again. Unfortunately I haven't found solution that
I would be very happy with. What I did is in case we read
xl_running_xacts which has all transactions we track finished, we start
tracking from that new xl_running_xacts again with the difference that
we clean up the running transactions based on previously seen committed
ones. That means that on busy server we may wait for multiple
xl_running_xacts rather than just one, but at least we have chance to
finish unlike with current coding which basically waits for empty
xl_running_xacts. I also removed the additional locking for logical
wal_level in LogStandbySnapshot() since it does not work.

Not hearing any opposition to this idea so I decided to polish this and
also optimize it a bit.

That being said, thanks to testing from Erik Rijkers I've identified one
more bug in how we do the initial snapshot. Apparently we don't reserve
the global xmin when we start building the initial exported snapshot for
a slot (we only reserver catalog_xmin which is fine for logical decoding
but not for the exported snapshot) so the VACUUM and heap pruning will
happily delete old versions of rows that are still needed by anybody
trying to use that exported snapshot.

Aaand I found one more bug in snapbuild. Apparently we don't protect the
snapshot builder xmin from going backwards which can yet again result in
corrupted exported snapshot.

Summary of attached patches:
0001 - Fixes the above mentioned global xmin tracking issues. Needs to
be backported all the way to 9.4

0002 - Removes use of the logical decoding saved snapshots for initial
exported snapshot since those snapshots only work for catalogs and not
user data. Also needs to be backported all the way to 9.4.

0003 - Makes sure snapshot builder xmin is not moved backwards by
xl_running_xacts (which can otherwise happen during initial snapshot
building). Also should be backported to 9.4.

0004 - Changes handling of the xl_running_xacts in initial snapshot
build to what I wrote above and removes the extra locking from
LogStandbySnapshot introduced by logical decoding.

0005 - Improves performance of initial snapshot building by skipping
catalog snapshot build for transactions that don't do catalog changes.

I did some improvements to the other patches as well so they are not the
same as in previous post, hence the version bump.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchtext/x-patch; name=snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchDownload
From dfa0b07639059675704a35f2e63be4934f97c3a3 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
 src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..57c392c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
 	 * protecting against vacuum.
+	 *
+	 * Note that we only store the global xmin temporarily in the in-memory
+	 * state so that the initial snapshot can be exported. After initial
+	 * snapshot is done global xmin should be reset and not tracked anymore
+	 * so we are fine with losing the global xmin after crash.
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	slot->effective_xmin = slot->effective_catalog_xmin;
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
@@ -282,7 +288,7 @@ CreateInitDecodingContext(char *plugin,
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
 	 */
-	xmin_horizon = slot->data.catalog_xmin;
+	xmin_horizon = slot->effective_xmin;
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
@@ -456,9 +462,22 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 void
 FreeDecodingContext(LogicalDecodingContext *ctx)
 {
+	ReplicationSlot *slot = MyReplicationSlot;
+
 	if (ctx->callbacks.shutdown_cb != NULL)
 		shutdown_cb_wrapper(ctx);
 
+	/*
+	 * Cleanup global xmin for the slot that we may have set in
+	 * CreateInitDecodingContext(). We do not take ProcArrayLock or similar
+	 * since we only reset xmin here and there's not much harm done by a
+	 * concurrent computation missing that.
+	 */
+	SpinLockAcquire(&slot->mutex);
+	slot->effective_xmin = InvalidTransactionId;
+	SpinLockRelease(&slot->mutex);
+	ReplicationSlotsComputeRequiredXmin(false);
+
 	ReorderBufferFree(ctx->reorder);
 	FreeSnapshotBuilder(ctx->snapshot_builder);
 	XLogReaderFree(ctx->reader);
-- 
2.7.4

snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchtext/x-patch; name=snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchDownload
From 75b24caff2dff98cefd05e4bce7a0600bdeec2d8 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 20:14:44 +0100
Subject: [PATCH 2/5] Don't use on disk snapshots for snapshot export in
 logical decoding

We store historical snapshots on disk to enable continuation of logical
decoding after restart. These snapshots were reused by the
slot initialization code when searching for consistent snapshot. However
these snapshots are only useful for catalogs and not for normal user
tables. So when we exported such snapshots for user to read data from
tables that is consistent with a specific LSN of slot creation, user
would instead read wrong data. There does not seem to be simple way to
make the logical decoding historical snapshots useful for normal tables
so don't use them for exporting at all for now.
---
 src/backend/replication/logical/snapbuild.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6f19cdc..b742c79 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1210,11 +1210,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 {
 	/* ---
 	 * Build catalog decoding snapshot incrementally using information about
-	 * the currently running transactions. There are several ways to do that:
+	 * the currently running transactions. There are couple ways to do that:
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 *	  state we were waiting for b).
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1227,9 +1227,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  Interestingly, in contrast to HS, this allows us not to care about
 	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
 	 *	  at all.
-	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.
 	 * ---
 	 */
 
@@ -1284,13 +1281,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
-	{
-		/* there won't be any state to cleanup */
-		return false;
-	}
-
 	/*
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
-- 
2.7.4

snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patchtext/x-patch; name=snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patchDownload
From e45af918ff67c936d34728c6acf04dc62074f691 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

---
 src/backend/replication/logical/snapbuild.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b742c79..1e8346d 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1141,7 +1141,8 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * looking, it's correct and actually more efficient this way since we hit
 	 * fast paths in tqual.c.
 	 */
-	builder->xmin = running->oldestRunningXid;
+	if (TransactionIdFollowsOrEquals(running->oldestRunningXid, builder->xmin))
+		builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
 	SnapBuildPurgeCommittedTxn(builder);
-- 
2.7.4

snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patchtext/x-patch; name=snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patchDownload
From 12b1228e8b6cf3040ba14a34253fcbb35fedff8f Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.

This also reverts changes made to GetRunningTransactionData() and
LogStandbySnapshot() by b89e151 as the additional locking does not help.
---
 src/backend/replication/logical/snapbuild.c | 71 ++++++++++++++++++++++++-----
 src/backend/storage/ipc/procarray.c         |  5 +-
 src/backend/storage/ipc/standby.c           | 19 --------
 3 files changed, 63 insertions(+), 32 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1e8346d..f683c24 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().
 	 *	  NB: We need to search running.xip when seeing a transaction's end to
 	 *	  make sure it's a toplevel transaction and it's been one of the
 	 *	  initially running ones.
@@ -1286,11 +1291,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
 	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.
 	 */
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)
 	{
 		int			off;
+		bool		first = builder->running.xcnt == 0;
 
 		/*
 		 * We only care about toplevel xids as those are the ones we
@@ -1326,20 +1337,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		builder->running.xmin = builder->running.xip[0];
 		builder->running.xmax = builder->running.xip[running->xcnt - 1];
 
+
 		/* makes comparisons cheaper later */
 		TransactionIdRetreat(builder->running.xmin);
 		TransactionIdAdvance(builder->running.xmax);
 
 		builder->state = SNAPBUILD_FULL_SNAPSHOT;
 
-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
 		/*
 		 * Iterate through all xids, wait for them to finish.
 		 *
@@ -1359,9 +1363,54 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 			if (TransactionIdIsCurrentTransactionId(xid))
 				elog(ERROR, "waiting for ourselves");
 
+			/*
+			 * This isn't required for the correctness of decoding, but to allow
+			 * isolationtester to notice that we're currently waiting for
+			 * something.
+			 */
 			XactLockTableWait(xid, NULL, NULL, XLTW_None);
 		}
 
+		/*
+		 * If this is the first time we've seen xl_running_xacts, we are done.
+		 */
+		if (first)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			/*
+			 * Because of the race condition in LogStandbySnapshot() the
+			 * transactions recorded in xl_running_xacts as running might have
+			 * already committed by the time the xl_running_xacts was written
+			 * to WAL. Use the information about decoded transactions that we
+			 * gathered so far to update our idea about what's still running.
+			 *
+			 * We can use SnapBuildEndTxn directly as it only does the
+			 * transaction running check and handling without any additional
+			 * side effects.
+			 */
+			for (off = 0; off < builder->committed.xcnt; off++)
+				SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
+			if (builder->state == SNAPBUILD_CONSISTENT)
+				return false;
+
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+
 		/* nothing could have built up so far, so don't perform cleanup */
 		return false;
 	}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cd14667..4ea81f8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2055,12 +2055,13 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
 
+	/* We don't release XidGenLock here, the caller is responsible for that */
+	LWLockRelease(ProcArrayLock);
+
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
 	Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));
 
-	/* We don't release the locks here, the caller is responsible for that */
-
 	return CurrentRunningXacts;
 }
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..f461f21 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -933,27 +933,8 @@ LogStandbySnapshot(void)
 	 */
 	running = GetRunningTransactionData();
 
-	/*
-	 * GetRunningTransactionData() acquired ProcArrayLock, we must release it.
-	 * For Hot Standby this can be done before inserting the WAL record
-	 * because ProcArrayApplyRecoveryInfo() rechecks the commit status using
-	 * the clog. For logical decoding, though, the lock can't be released
-	 * early because the clog might be "in the future" from the POV of the
-	 * historic snapshot. This would allow for situations where we're waiting
-	 * for the end of a transaction listed in the xl_running_xacts record
-	 * which, according to the WAL, has committed before the xl_running_xacts
-	 * record. Fortunately this routine isn't executed frequently, and it's
-	 * only a shared lock.
-	 */
-	if (wal_level < WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	recptr = LogCurrentRunningXacts(running);
 
-	/* Release lock if we kept it longer ... */
-	if (wal_level >= WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
-- 
2.7.4

snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patchtext/x-patch; name=snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patchDownload
From 97212069802bb2a727be7597a89ba73febca4fef Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 5/5] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
 src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index f683c24..5dbf87b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -991,21 +1001,10 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 
@@ -1017,6 +1016,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
 	/*
@@ -1025,14 +1034,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
 	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
@@ -1045,10 +1048,18 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (forced_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
 	{
+		bool build_snapshot;
+
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
 		 * catalog modifying, transactions, everything else isn't interesting
@@ -1069,14 +1080,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			return;
 
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Build snapshot if needed. We need to build it if there isn't one
+		 * already built, or if the transaction has made catalog changes or
+		 * when we can't know if transaction made catalog changes.
 		 */
-		if (builder->snapshot)
+		build_snapshot = !builder->snapshot || top_needs_timetravel ||
+			sub_needs_timetravel || !skip_forced_snapshot;
+
+		/*
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
+		 */
+		if (builder->snapshot && build_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (build_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1086,11 +1112,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (build_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

#15Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#14)
5 attachment(s)
Re: snapbuild woes

On 26/02/17 01:43, Petr Jelinek wrote:

On 24/02/17 22:56, Petr Jelinek wrote:

On 22/02/17 03:05, Petr Jelinek wrote:

On 13/12/16 00:38, Petr Jelinek wrote:

On 12/12/16 23:33, Andres Freund wrote:

On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:

On 12/12/16 22:42, Andres Freund wrote:

Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:

Hi,
First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged.

I don't think that's actually true? Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you
observed must actually be a bit more complex :(

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

I suspect the reason for that is that RecordTransactionCommit() doesn't
conflict with ProcArrayLock in the first place - only
ProcArrayEndTransaction() does. So they're still running in the PGPROC
sense, just not the crash-recovery sense...

That looks like reasonable explanation. BTW I realized my patch needs
bit more work, currently it will break the actual snapshot as it behaves
same as if the xl_running_xacts was empty which is not correct AFAICS.

Hi,

I got to work on this again. Unfortunately I haven't found solution that
I would be very happy with. What I did is in case we read
xl_running_xacts which has all transactions we track finished, we start
tracking from that new xl_running_xacts again with the difference that
we clean up the running transactions based on previously seen committed
ones. That means that on busy server we may wait for multiple
xl_running_xacts rather than just one, but at least we have chance to
finish unlike with current coding which basically waits for empty
xl_running_xacts. I also removed the additional locking for logical
wal_level in LogStandbySnapshot() since it does not work.

Not hearing any opposition to this idea so I decided to polish this and
also optimize it a bit.

That being said, thanks to testing from Erik Rijkers I've identified one
more bug in how we do the initial snapshot. Apparently we don't reserve
the global xmin when we start building the initial exported snapshot for
a slot (we only reserver catalog_xmin which is fine for logical decoding
but not for the exported snapshot) so the VACUUM and heap pruning will
happily delete old versions of rows that are still needed by anybody
trying to use that exported snapshot.

Aaand I found one more bug in snapbuild. Apparently we don't protect the
snapshot builder xmin from going backwards which can yet again result in
corrupted exported snapshot.

Summary of attached patches:
0001 - Fixes the above mentioned global xmin tracking issues. Needs to
be backported all the way to 9.4

0002 - Removes use of the logical decoding saved snapshots for initial
exported snapshot since those snapshots only work for catalogs and not
user data. Also needs to be backported all the way to 9.4.

I've been bit overzealous about this one (removed the use of the saved
snapshots completely). So here is a fix for that (and rebase on top of
current HEAD).

0003 - Makes sure snapshot builder xmin is not moved backwards by
xl_running_xacts (which can otherwise happen during initial snapshot
building). Also should be backported to 9.4.

0004 - Changes handling of the xl_running_xacts in initial snapshot
build to what I wrote above and removes the extra locking from
LogStandbySnapshot introduced by logical decoding.

0005 - Improves performance of initial snapshot building by skipping
catalog snapshot build for transactions that don't do catalog changes.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchtext/x-patch; name=snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchDownload
From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
 src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..57c392c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
 	 * protecting against vacuum.
+	 *
+	 * Note that we only store the global xmin temporarily in the in-memory
+	 * state so that the initial snapshot can be exported. After initial
+	 * snapshot is done global xmin should be reset and not tracked anymore
+	 * so we are fine with losing the global xmin after crash.
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	slot->effective_xmin = slot->effective_catalog_xmin;
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
@@ -282,7 +288,7 @@ CreateInitDecodingContext(char *plugin,
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
 	 */
-	xmin_horizon = slot->data.catalog_xmin;
+	xmin_horizon = slot->effective_xmin;
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
@@ -456,9 +462,22 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 void
 FreeDecodingContext(LogicalDecodingContext *ctx)
 {
+	ReplicationSlot *slot = MyReplicationSlot;
+
 	if (ctx->callbacks.shutdown_cb != NULL)
 		shutdown_cb_wrapper(ctx);
 
+	/*
+	 * Cleanup global xmin for the slot that we may have set in
+	 * CreateInitDecodingContext(). We do not take ProcArrayLock or similar
+	 * since we only reset xmin here and there's not much harm done by a
+	 * concurrent computation missing that.
+	 */
+	SpinLockAcquire(&slot->mutex);
+	slot->effective_xmin = InvalidTransactionId;
+	SpinLockRelease(&slot->mutex);
+	ReplicationSlotsComputeRequiredXmin(false);
+
 	ReorderBufferFree(ctx->reorder);
 	FreeSnapshotBuilder(ctx->snapshot_builder);
 	XLogReaderFree(ctx->reader);
-- 
2.7.4

snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchtext/x-patch; name=snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchDownload
From b171eb533f4dd14f9f5082b469e5218c1bf13682 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 20:14:44 +0100
Subject: [PATCH 2/5] Don't use on disk snapshots for snapshot export in
 logical decoding

We store historical snapshots on disk to enable continuation of logical
decoding after restart. These snapshots were also used bu slot
initialiation code for initial snapshot that the slot exports to aid
synchronization of data copy and the stream consumption. However
these snapshots are only useful for catalogs and not for normal user
tables. So when we exported such snapshots for user to read data from
tables that is consistent with a specific LSN of slot creation, user
would instead read wrong data.

This patch changes the code so that stored snapshots are only used for
logical decoding restart but not for initial slot snapshot.
---
 src/backend/replication/logical/snapbuild.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 6f19cdc..4b0c1e0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1214,7 +1214,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 *	  state we were waiting for b) or c).
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1229,7 +1229,9 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  at all.
 	 *
 	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.
+	 *	  snapshot to disk that we can use. We can't use this method for the
+	 *	  initial snapshot when slot is being created as that snapshot may be
+	 *	  exported and used for reading user data.
 	 * ---
 	 */
 
@@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
+	/* c) valid on disk state and not exported snapshot */
+	else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
+			 SnapBuildRestore(builder, lsn))
 	{
 		/* there won't be any state to cleanup */
 		return false;
 	}
-
 	/*
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
-- 
2.7.4

snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patchtext/x-patch; name=snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patchDownload
From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

---
 src/backend/replication/logical/snapbuild.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4b0c1e0..49c4337 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1141,7 +1141,8 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * looking, it's correct and actually more efficient this way since we hit
 	 * fast paths in tqual.c.
 	 */
-	builder->xmin = running->oldestRunningXid;
+	if (TransactionIdFollowsOrEquals(running->oldestRunningXid, builder->xmin))
+		builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
 	SnapBuildPurgeCommittedTxn(builder);
-- 
2.7.4

snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patchtext/x-patch; name=snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patchDownload
From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.

This also reverts changes made to GetRunningTransactionData() and
LogStandbySnapshot() by b89e151 as the additional locking does not help.
---
 src/backend/replication/logical/snapbuild.c | 71 ++++++++++++++++++++++++-----
 src/backend/storage/ipc/procarray.c         |  5 +-
 src/backend/storage/ipc/standby.c           | 19 --------
 3 files changed, 63 insertions(+), 32 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 49c4337..1a1c9ba 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().
 	 *	  NB: We need to search running.xip when seeing a transaction's end to
 	 *	  make sure it's a toplevel transaction and it's been one of the
 	 *	  initially running ones.
@@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
 	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.
 	 */
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)
 	{
 		int			off;
+		bool		first = builder->running.xcnt == 0;
 
 		/*
 		 * We only care about toplevel xids as those are the ones we
@@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		builder->running.xmin = builder->running.xip[0];
 		builder->running.xmax = builder->running.xip[running->xcnt - 1];
 
+
 		/* makes comparisons cheaper later */
 		TransactionIdRetreat(builder->running.xmin);
 		TransactionIdAdvance(builder->running.xmax);
 
 		builder->state = SNAPBUILD_FULL_SNAPSHOT;
 
-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
 		/*
 		 * Iterate through all xids, wait for them to finish.
 		 *
@@ -1371,9 +1375,54 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 			if (TransactionIdIsCurrentTransactionId(xid))
 				elog(ERROR, "waiting for ourselves");
 
+			/*
+			 * This isn't required for the correctness of decoding, but to allow
+			 * isolationtester to notice that we're currently waiting for
+			 * something.
+			 */
 			XactLockTableWait(xid, NULL, NULL, XLTW_None);
 		}
 
+		/*
+		 * If this is the first time we've seen xl_running_xacts, we are done.
+		 */
+		if (first)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			/*
+			 * Because of the race condition in LogStandbySnapshot() the
+			 * transactions recorded in xl_running_xacts as running might have
+			 * already committed by the time the xl_running_xacts was written
+			 * to WAL. Use the information about decoded transactions that we
+			 * gathered so far to update our idea about what's still running.
+			 *
+			 * We can use SnapBuildEndTxn directly as it only does the
+			 * transaction running check and handling without any additional
+			 * side effects.
+			 */
+			for (off = 0; off < builder->committed.xcnt; off++)
+				SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
+			if (builder->state == SNAPBUILD_CONSISTENT)
+				return false;
+
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+
 		/* nothing could have built up so far, so don't perform cleanup */
 		return false;
 	}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cd14667..4ea81f8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2055,12 +2055,13 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
 
+	/* We don't release XidGenLock here, the caller is responsible for that */
+	LWLockRelease(ProcArrayLock);
+
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
 	Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));
 
-	/* We don't release the locks here, the caller is responsible for that */
-
 	return CurrentRunningXacts;
 }
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..f461f21 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -933,27 +933,8 @@ LogStandbySnapshot(void)
 	 */
 	running = GetRunningTransactionData();
 
-	/*
-	 * GetRunningTransactionData() acquired ProcArrayLock, we must release it.
-	 * For Hot Standby this can be done before inserting the WAL record
-	 * because ProcArrayApplyRecoveryInfo() rechecks the commit status using
-	 * the clog. For logical decoding, though, the lock can't be released
-	 * early because the clog might be "in the future" from the POV of the
-	 * historic snapshot. This would allow for situations where we're waiting
-	 * for the end of a transaction listed in the xl_running_xacts record
-	 * which, according to the WAL, has committed before the xl_running_xacts
-	 * record. Fortunately this routine isn't executed frequently, and it's
-	 * only a shared lock.
-	 */
-	if (wal_level < WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	recptr = LogCurrentRunningXacts(running);
 
-	/* Release lock if we kept it longer ... */
-	if (wal_level >= WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
-- 
2.7.4

snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patchtext/x-patch; name=snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patchDownload
From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 5/5] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
 src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1a1c9ba..c800aa5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -991,21 +1001,10 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 
@@ -1017,6 +1016,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
 	/*
@@ -1025,14 +1034,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
 	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
@@ -1045,10 +1048,18 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (forced_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
 	{
+		bool build_snapshot;
+
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
 		 * catalog modifying, transactions, everything else isn't interesting
@@ -1069,14 +1080,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			return;
 
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Build snapshot if needed. We need to build it if there isn't one
+		 * already built, or if the transaction has made catalog changes or
+		 * when we can't know if transaction made catalog changes.
 		 */
-		if (builder->snapshot)
+		build_snapshot = !builder->snapshot || top_needs_timetravel ||
+			sub_needs_timetravel || !skip_forced_snapshot;
+
+		/*
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
+		 */
+		if (builder->snapshot && build_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (build_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1086,11 +1112,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (build_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

#16Erik Rijkers
er@xs4all.nl
In reply to: Petr Jelinek (#15)
Re: snapbuild woes

On 2017-03-03 01:30, Petr Jelinek wrote:

With these patches:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

I get:

subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows
non-static declaration
static int oid_cmp(const void *p1, const void *p2);
^~~~~~~
In file included from subscriptioncmds.c:42:0:
../../../src/include/utils/builtins.h:70:12: note: previous declaration
of ‘oid_cmp’ was here
extern int oid_cmp(const void *p1, const void *p2);
^~~~~~~
make[3]: *** [subscriptioncmds.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [commands-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Erik Rijkers (#16)
Re: snapbuild woes

On 03/03/17 01:53, Erik Rijkers wrote:

On 2017-03-03 01:30, Petr Jelinek wrote:

With these patches:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch

snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

I get:

subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows
non-static declaration
static int oid_cmp(const void *p1, const void *p2);
^~~~~~~
In file included from subscriptioncmds.c:42:0:
../../../src/include/utils/builtins.h:70:12: note: previous declaration
of ‘oid_cmp’ was here
extern int oid_cmp(const void *p1, const void *p2);
^~~~~~~
make[3]: *** [subscriptioncmds.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [commands-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2

Yes the copy patch needs rebase as well. But these ones are fine.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18David Steele
david@pgmasters.net
In reply to: Petr Jelinek (#17)
Re: snapbuild woes

On 3/2/17 7:54 PM, Petr Jelinek wrote:

Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: David Steele (#18)
Re: snapbuild woes

On 2017-04-08 09:51:39 -0400, David Steele wrote:

On 3/2/17 7:54 PM, Petr Jelinek wrote:

Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.

FWIW, as these are bug-fixes that need to be backpatched, I do plan to
work on them soon.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Erik Rijkers
er@xs4all.nl
In reply to: Andres Freund (#19)
Re: snapbuild woes

On 2017-04-08 15:56, Andres Freund wrote:

On 2017-04-08 09:51:39 -0400, David Steele wrote:

On 3/2/17 7:54 PM, Petr Jelinek wrote:

Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.

FWIW, as these are bug-fixes that need to be backpatched, I do plan to
work on them soon.

CF 2017-07 pertains to postgres 11, is that right?

But I hope you mean to commit these snapbuild patches before the
postgres 10 release? As far as I know, logical replication is still
very broken without them (or at least some of that set of 5 patches - I
don't know which ones are essential and which may not be).

If it's at all useful I can repeat tests to show how often current
master still fails (easily 50% or so failure-rate).

This would be the pgbench-over-logical-replication test that I did so
often earlier on.

thanks,

Erik Rijkers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Andres Freund
andres@anarazel.de
In reply to: Erik Rijkers (#20)
Re: snapbuild woes

On 2017-04-08 16:29:10 +0200, Erik Rijkers wrote:

On 2017-04-08 15:56, Andres Freund wrote:

On 2017-04-08 09:51:39 -0400, David Steele wrote:

On 3/2/17 7:54 PM, Petr Jelinek wrote:

Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.

FWIW, as these are bug-fixes that need to be backpatched, I do plan to
work on them soon.

CF 2017-07 pertains to postgres 11, is that right?

But I hope you mean to commit these snapbuild patches before the postgres 10
release? As far as I know, logical replication is still very broken without
them (or at least some of that set of 5 patches - I don't know which ones
are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22David Steele
david@pgmasters.net
In reply to: Erik Rijkers (#20)
Re: snapbuild woes

On 4/8/17 10:29 AM, Erik Rijkers wrote:

On 2017-04-08 15:56, Andres Freund wrote:

On 2017-04-08 09:51:39 -0400, David Steele wrote:

On 3/2/17 7:54 PM, Petr Jelinek wrote:

Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.

FWIW, as these are bug-fixes that need to be backpatched, I do plan to
work on them soon.

CF 2017-07 pertains to postgres 11, is that right?

In general, yes, but bugs will always be fixed as needed. It doesn't
matter what CF they are in.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#21)
Re: snapbuild woes

On Sat, Apr 08, 2017 at 07:30:59AM -0700, Andres Freund wrote:

On 2017-04-08 16:29:10 +0200, Erik Rijkers wrote:

On 2017-04-08 15:56, Andres Freund wrote:

On 2017-04-08 09:51:39 -0400, David Steele wrote:

On 3/2/17 7:54 PM, Petr Jelinek wrote:

Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.

FWIW, as these are bug-fixes that need to be backpatched, I do plan to
work on them soon.

CF 2017-07 pertains to postgres 11, is that right?

But I hope you mean to commit these snapbuild patches before the postgres 10
release? As far as I know, logical replication is still very broken without
them (or at least some of that set of 5 patches - I don't know which ones
are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Noah Misch (#23)
Re: snapbuild woes

On 4/12/17 02:31, Noah Misch wrote:

But I hope you mean to commit these snapbuild patches before the postgres 10
release? As far as I know, logical replication is still very broken without
them (or at least some of that set of 5 patches - I don't know which ones
are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

[Action required within three days. This is a generic notification.]

I'm hoping for a word from Andres on this.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Simon Riggs
simon@2ndquadrant.com
In reply to: Petr Jelinek (#15)
Re: snapbuild woes

On 3 March 2017 at 00:30, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

0004 - Changes handling of the xl_running_xacts in initial snapshot
build to what I wrote above and removes the extra locking from
LogStandbySnapshot introduced by logical decoding.

This seems OK and unlikely to have wider impact.

The "race condition" we're speaking about is by design, not a bug.

I think the initial comment could be slightly better worded; if I
didn't already know what is being discussed, I wouldnt be too much
further forwards from those comments.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#24)
Re: snapbuild woes

On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:

On 4/12/17 02:31, Noah Misch wrote:

But I hope you mean to commit these snapbuild patches before the postgres 10
release? As far as I know, logical replication is still very broken without
them (or at least some of that set of 5 patches - I don't know which ones
are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

[Action required within three days. This is a generic notification.]

I'm hoping for a word from Andres on this.

Feel free to reassign to me.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#15)
Re: snapbuild woes

Hi,
On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:

From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..57c392c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
* the slot machinery about the new limit. Once that's done the
* ProcArrayLock can be released as the slot machinery now is
* protecting against vacuum.
+	 *
+	 * Note that we only store the global xmin temporarily in the in-memory
+	 * state so that the initial snapshot can be exported. After initial
+	 * snapshot is done global xmin should be reset and not tracked anymore
+	 * so we are fine with losing the global xmin after crash.
* ----
*/
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
slot->data.catalog_xmin = slot->effective_catalog_xmin;
+ slot->effective_xmin = slot->effective_catalog_xmin;

void
FreeDecodingContext(LogicalDecodingContext *ctx)
{
+	ReplicationSlot *slot = MyReplicationSlot;
+
if (ctx->callbacks.shutdown_cb != NULL)
shutdown_cb_wrapper(ctx);
+	/*
+	 * Cleanup global xmin for the slot that we may have set in
+	 * CreateInitDecodingContext().

Hm. Is that actually a meaningful point to do so? For one, it gets
called by pg_logical_slot_get_changes_guts(), but more importantly, the
snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
next command? If we rely on the snapshot magic done by ExportSnapshot()
it'd be worthwhile to mention that...

We do not take ProcArrayLock or similar
+	 * since we only reset xmin here and there's not much harm done by a
+	 * concurrent computation missing that.
+	 */

Hum. I was prepared to complain about this, but ISTM, that there's
absolutely no risk because the following
ReplicationSlotsComputeRequiredXmin(false); actually does all the
necessary locking? But still, I don't see much point in the
optimization.

This patch changes the code so that stored snapshots are only used for
logical decoding restart but not for initial slot snapshot.

Yea, that's a very good point...

@@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn

return false;
}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
+	/* c) valid on disk state and not exported snapshot */
+	else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
+			 SnapBuildRestore(builder, lsn))

Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
interface, where it'd strictly speaking not be required, right?

From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

A bit more commentary would be good. What does that protect us against?

From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.

Needs more explanation about approach.

@@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
*	  simply track the number of in-progress toplevel transactions and
*	  lower it whenever one commits or aborts. When that number
*	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().

I'm not following this yet.

@@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
* b) first encounter of a useable xl_running_xacts record. If we had
* found one earlier we would either track running transactions (i.e.
* builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.

This needs to revise the preceding comment more heavily. "This is the
first!!! Or maybe not!" isn't easy to understand.

*/
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)

Isn't that wrong under wraparound?

{
int off;
+ bool first = builder->running.xcnt == 0;

/*
* We only care about toplevel xids as those are the ones we
@@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
builder->running.xmin = builder->running.xip[0];
builder->running.xmax = builder->running.xip[running->xcnt - 1];

+
/* makes comparisons cheaper later */
TransactionIdRetreat(builder->running.xmin);
TransactionIdAdvance(builder->running.xmax);

builder->state = SNAPBUILD_FULL_SNAPSHOT;

-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
+		/*
+		 * If this is the first time we've seen xl_running_xacts, we are done.
+		 */
+		if (first)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			/*
+			 * Because of the race condition in LogStandbySnapshot() the
+			 * transactions recorded in xl_running_xacts as running might have
+			 * already committed by the time the xl_running_xacts was written
+			 * to WAL. Use the information about decoded transactions that we
+			 * gathered so far to update our idea about what's still running.
+			 *
+			 * We can use SnapBuildEndTxn directly as it only does the
+			 * transaction running check and handling without any additional
+			 * side effects.
+			 */
+			for (off = 0; off < builder->committed.xcnt; off++)
+				SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
+			if (builder->state == SNAPBUILD_CONSISTENT)
+				return false;
+
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}

Hm, this is not pretty.

From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 5/5] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1a1c9ba..c800aa5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
bool		forced_timetravel = false;
bool		sub_needs_timetravel = false;
bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;

TransactionId xmax = xid;

@@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
/*
* We could avoid treating !SnapBuildTxnIsRunning transactions as
* timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
*/
forced_timetravel = true;
elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
}

That's pretty crudely bolted on the existing logic, isn't there a
simpler way?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#26)
Re: snapbuild woes

On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:

On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:

On 4/12/17 02:31, Noah Misch wrote:

But I hope you mean to commit these snapbuild patches before the postgres 10
release? As far as I know, logical replication is still very broken without
them (or at least some of that set of 5 patches - I don't know which ones
are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

[Action required within three days. This is a generic notification.]

I'm hoping for a word from Andres on this.

Feel free to reassign to me.

Thanks for volunteering; I'll do that shortly. Please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#28)
Re: snapbuild woes

On April 12, 2017 9:58:12 PM PDT, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:

On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:

On 4/12/17 02:31, Noah Misch wrote:

But I hope you mean to commit these snapbuild patches before

the postgres 10

release? As far as I know, logical replication is still very

broken without

them (or at least some of that set of 5 patches - I don't know

which ones

are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't

plan to

wait for 2017-07.

[Action required within three days. This is a generic

notification.]

I'm hoping for a word from Andres on this.

Feel free to reassign to me.

Thanks for volunteering; I'll do that shortly. Please observe the
policy on
open item ownership[1] and send a status update within three calendar
days of
this message. Include a date for your subsequent status update.

[1]
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Will, volunteering might be the wrong word. These ate all my fault, although none look v10 specific.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#27)
Re: snapbuild woes

Thanks for looking at this!

On 13/04/17 02:29, Andres Freund wrote:

Hi,
On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:

From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..57c392c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
* the slot machinery about the new limit. Once that's done the
* ProcArrayLock can be released as the slot machinery now is
* protecting against vacuum.
+	 *
+	 * Note that we only store the global xmin temporarily in the in-memory
+	 * state so that the initial snapshot can be exported. After initial
+	 * snapshot is done global xmin should be reset and not tracked anymore
+	 * so we are fine with losing the global xmin after crash.
* ----
*/
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
slot->data.catalog_xmin = slot->effective_catalog_xmin;
+ slot->effective_xmin = slot->effective_catalog_xmin;

void
FreeDecodingContext(LogicalDecodingContext *ctx)
{
+	ReplicationSlot *slot = MyReplicationSlot;
+
if (ctx->callbacks.shutdown_cb != NULL)
shutdown_cb_wrapper(ctx);
+	/*
+	 * Cleanup global xmin for the slot that we may have set in
+	 * CreateInitDecodingContext().

Hm. Is that actually a meaningful point to do so? For one, it gets
called by pg_logical_slot_get_changes_guts(), but more importantly, the
snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
next command? If we rely on the snapshot magic done by ExportSnapshot()
it'd be worthwhile to mention that...

(Didn't see the patch for couple of months so don't remember all the
detailed decisions anymore)

Yes we rely on backend's xmin to be set for exported snapshot. We only
care about global xmin for exported snapshots really, I assumed that's
clear enough from "so that the initial snapshot can be exported" but I
guess there should be more clear comment about this where we actually
clean this up.

We do not take ProcArrayLock or similar
+	 * since we only reset xmin here and there's not much harm done by a
+	 * concurrent computation missing that.
+	 */

Hum. I was prepared to complain about this, but ISTM, that there's
absolutely no risk because the following
ReplicationSlotsComputeRequiredXmin(false); actually does all the
necessary locking? But still, I don't see much point in the
optimization.

Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see
why we need it here. Please enlighten me.

This patch changes the code so that stored snapshots are only used for
logical decoding restart but not for initial slot snapshot.

Yea, that's a very good point...

@@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn

return false;
}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
+	/* c) valid on disk state and not exported snapshot */
+	else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
+			 SnapBuildRestore(builder, lsn))

Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
interface, where it'd strictly speaking not be required, right?

Good point. Maybe we should really tell snapshot builder if the snapshot
is going to be exported or not explicitly (see the rant all the way down).

From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

A bit more commentary would be good. What does that protect us against?

I think I explained that in the email. We might export snapshot with
xmin smaller than global xmin otherwise.

From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.

Needs more explanation about approach.

@@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
*	  simply track the number of in-progress toplevel transactions and
*	  lower it whenever one commits or aborts. When that number
*	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().

I'm not following this yet.

Let me try to explain:
We get xl_running_xacts with txes 1,3,4. But the 1 already committed
before so the decoding will never see it and we never get snapshot. Now
at some point we might get xl_running_xact with txes 6,7,8 so we know
that all transactions from the initial xl_running_xacts must be closed.
We restart the tracking here from beginning as if this was first
xl_running_xacts we've seen, with the exception that we look into past
if we seen the 6,7,8 transactions already as finished, then we'll mark
them as finished immediately (hence avoiding the issue with transaction
6 being already committed before xl_running_xacts was written).

@@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
* b) first encounter of a useable xl_running_xacts record. If we had
* found one earlier we would either track running transactions (i.e.
* builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.

This needs to revise the preceding comment more heavily. "This is the
first!!! Or maybe not!" isn't easy to understand.

Yeah, I found it bit hard to make it sound correct and not confusing,
even wondered if I should split this code to two because of that but it
would lead into quite a bit of code duplication, dunno if that's better.
Maybe we could move the "reset" code into separate function to avoid
most of the duplication.

*/
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)

Isn't that wrong under wraparound?

Right, should use TransactionIdFollows.

{
int off;
+ bool first = builder->running.xcnt == 0;

/*
* We only care about toplevel xids as those are the ones we
@@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
builder->running.xmin = builder->running.xip[0];
builder->running.xmax = builder->running.xip[running->xcnt - 1];

+
/* makes comparisons cheaper later */
TransactionIdRetreat(builder->running.xmin);
TransactionIdAdvance(builder->running.xmax);

builder->state = SNAPBUILD_FULL_SNAPSHOT;

-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
+		/*
+		 * If this is the first time we've seen xl_running_xacts, we are done.
+		 */
+		if (first)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			/*
+			 * Because of the race condition in LogStandbySnapshot() the
+			 * transactions recorded in xl_running_xacts as running might have
+			 * already committed by the time the xl_running_xacts was written
+			 * to WAL. Use the information about decoded transactions that we
+			 * gathered so far to update our idea about what's still running.
+			 *
+			 * We can use SnapBuildEndTxn directly as it only does the
+			 * transaction running check and handling without any additional
+			 * side effects.
+			 */
+			for (off = 0; off < builder->committed.xcnt; off++)
+				SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
+			if (builder->state == SNAPBUILD_CONSISTENT)
+				return false;
+
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}

Hm, this is not pretty.

Hmm? There are two possible scenarios that need to be handled
differently. Possibly another reason to split it out completely as
mentioned above.

From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 5/5] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1a1c9ba..c800aa5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
bool		forced_timetravel = false;
bool		sub_needs_timetravel = false;
bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;

TransactionId xmax = xid;

@@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
/*
* We could avoid treating !SnapBuildTxnIsRunning transactions as
* timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
*/
forced_timetravel = true;
elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
}

That's pretty crudely bolted on the existing logic, isn't there a
simpler way?

Agreed, however, every time I tried to make this prettier I ended up
either producing subtle bugs (see the initial email in this thread for
example of that), so I eventually gave up on pretty.

As a side note, my opinion after all this is that it's probably mistake
to try to use various situational conditions to make sure we are
building exportable snapshot. ISTM we should tell the snapbuilder
explicitly that the snapshot will be exported and it should behave
accordingly based on that. Because for example, we also should track
aborted transactions in the snapshot which is to be exported because
otherwise enough of them happening during snapshot building will result
in export failure due to too big snapshot. But this seems like too
invasive change to be back-portable.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#29)
Re: snapbuild woes

On 13/04/17 07:02, Andres Freund wrote:

On April 12, 2017 9:58:12 PM PDT, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:

On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:

On 4/12/17 02:31, Noah Misch wrote:

But I hope you mean to commit these snapbuild patches before

the postgres 10

release? As far as I know, logical replication is still very

broken without

them (or at least some of that set of 5 patches - I don't know

which ones

are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't

plan to

wait for 2017-07.

[Action required within three days. This is a generic

notification.]

I'm hoping for a word from Andres on this.

Feel free to reassign to me.

Thanks for volunteering; I'll do that shortly. Please observe the
policy on
open item ownership[1] and send a status update within three calendar
days of
this message. Include a date for your subsequent status update.

[1]
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Will, volunteering might be the wrong word. These ate all my fault, although none look v10 specific.

Yeah none of this is v10 specific, the importance to v10 is that it
affects the logical replication in core, not just extensions like in
9.4-9.6.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#30)
5 attachment(s)
Re: snapbuild woes

Hi, here is updated patch (details inline).

On 13/04/17 20:00, Petr Jelinek wrote:

Thanks for looking at this!

On 13/04/17 02:29, Andres Freund wrote:

Hi,
On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:

From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..57c392c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
* the slot machinery about the new limit. Once that's done the
* ProcArrayLock can be released as the slot machinery now is
* protecting against vacuum.
+	 *
+	 * Note that we only store the global xmin temporarily in the in-memory
+	 * state so that the initial snapshot can be exported. After initial
+	 * snapshot is done global xmin should be reset and not tracked anymore
+	 * so we are fine with losing the global xmin after crash.
* ----
*/
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
slot->data.catalog_xmin = slot->effective_catalog_xmin;
+ slot->effective_xmin = slot->effective_catalog_xmin;

void
FreeDecodingContext(LogicalDecodingContext *ctx)
{
+	ReplicationSlot *slot = MyReplicationSlot;
+
if (ctx->callbacks.shutdown_cb != NULL)
shutdown_cb_wrapper(ctx);
+	/*
+	 * Cleanup global xmin for the slot that we may have set in
+	 * CreateInitDecodingContext().

Hm. Is that actually a meaningful point to do so? For one, it gets
called by pg_logical_slot_get_changes_guts(), but more importantly, the
snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
next command? If we rely on the snapshot magic done by ExportSnapshot()
it'd be worthwhile to mention that...

(Didn't see the patch for couple of months so don't remember all the
detailed decisions anymore)

Yes we rely on backend's xmin to be set for exported snapshot. We only
care about global xmin for exported snapshots really, I assumed that's
clear enough from "so that the initial snapshot can be exported" but I
guess there should be more clear comment about this where we actually
clean this up.

Okay, wrote new comment there, how is it now?

We do not take ProcArrayLock or similar
+	 * since we only reset xmin here and there's not much harm done by a
+	 * concurrent computation missing that.
+	 */

Hum. I was prepared to complain about this, but ISTM, that there's
absolutely no risk because the following
ReplicationSlotsComputeRequiredXmin(false); actually does all the
necessary locking? But still, I don't see much point in the
optimization.

Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see
why we need it here. Please enlighten me.

I kept this as it was, after rereading, the
ReplicationSlotsComputeRequiredXmin(false) will do shared lock while if
we wanted to avoid mutex and do the xmin update under lock we'd need to
do exclusive lock so I think it's worth the optimization...

This patch changes the code so that stored snapshots are only used for
logical decoding restart but not for initial slot snapshot.

Yea, that's a very good point...

@@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn

return false;
}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
+	/* c) valid on disk state and not exported snapshot */
+	else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
+			 SnapBuildRestore(builder, lsn))

Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
interface, where it'd strictly speaking not be required, right?

Good point. Maybe we should really tell snapshot builder if the snapshot
is going to be exported or not explicitly (see the rant all the way down).

I added the new signaling mechanism (the new boolean option indicating
if we are building full snapshot which is only set when the snapshot is
exported or used by the transaction).

From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

A bit more commentary would be good. What does that protect us against?

I think I explained that in the email. We might export snapshot with
xmin smaller than global xmin otherwise.

Updated commit message with explanation as well.

From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.

Needs more explanation about approach.

@@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
*	  simply track the number of in-progress toplevel transactions and
*	  lower it whenever one commits or aborts. When that number
*	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().

I'm not following this yet.

Let me try to explain:
We get xl_running_xacts with txes 1,3,4. But the 1 already committed
before so the decoding will never see it and we never get snapshot. Now
at some point we might get xl_running_xact with txes 6,7,8 so we know
that all transactions from the initial xl_running_xacts must be closed.
We restart the tracking here from beginning as if this was first
xl_running_xacts we've seen, with the exception that we look into past
if we seen the 6,7,8 transactions already as finished, then we'll mark
them as finished immediately (hence avoiding the issue with transaction
6 being already committed before xl_running_xacts was written).

@@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
* b) first encounter of a useable xl_running_xacts record. If we had
* found one earlier we would either track running transactions (i.e.
* builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.

This needs to revise the preceding comment more heavily. "This is the
first!!! Or maybe not!" isn't easy to understand.

Yeah, I found it bit hard to make it sound correct and not confusing,
even wondered if I should split this code to two because of that but it
would lead into quite a bit of code duplication, dunno if that's better.
Maybe we could move the "reset" code into separate function to avoid
most of the duplication.

Rewrote and moved this comment to it's own thing.

*/
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)

Isn't that wrong under wraparound?

Right, should use TransactionIdFollows.

Fixed.

{
int off;
+ bool first = builder->running.xcnt == 0;

/*
* We only care about toplevel xids as those are the ones we
@@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
builder->running.xmin = builder->running.xip[0];
builder->running.xmax = builder->running.xip[running->xcnt - 1];

+
/* makes comparisons cheaper later */
TransactionIdRetreat(builder->running.xmin);
TransactionIdAdvance(builder->running.xmax);

builder->state = SNAPBUILD_FULL_SNAPSHOT;

-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
+		/*
+		 * If this is the first time we've seen xl_running_xacts, we are done.
+		 */
+		if (first)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			/*
+			 * Because of the race condition in LogStandbySnapshot() the
+			 * transactions recorded in xl_running_xacts as running might have
+			 * already committed by the time the xl_running_xacts was written
+			 * to WAL. Use the information about decoded transactions that we
+			 * gathered so far to update our idea about what's still running.
+			 *
+			 * We can use SnapBuildEndTxn directly as it only does the
+			 * transaction running check and handling without any additional
+			 * side effects.
+			 */
+			for (off = 0; off < builder->committed.xcnt; off++)
+				SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
+			if (builder->state == SNAPBUILD_CONSISTENT)
+				return false;
+
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}

Hm, this is not pretty.

Changed this whole thing to be 2 different code paths with common
function doing the common work.

From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 5/5] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1a1c9ba..c800aa5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
bool		forced_timetravel = false;
bool		sub_needs_timetravel = false;
bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;

TransactionId xmax = xid;

@@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
/*
* We could avoid treating !SnapBuildTxnIsRunning transactions as
* timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
*/
forced_timetravel = true;
elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
}

That's pretty crudely bolted on the existing logic, isn't there a
simpler way?

Agreed, however, every time I tried to make this prettier I ended up
either producing subtle bugs (see the initial email in this thread for
example of that), so I eventually gave up on pretty.

Okay, gave it one more try with fresh head, hopefully without new bugs,
what do you think?

As a side note, my opinion after all this is that it's probably mistake
to try to use various situational conditions to make sure we are
building exportable snapshot. ISTM we should tell the snapbuilder
explicitly that the snapshot will be exported and it should behave
accordingly based on that. Because for example, we also should track
aborted transactions in the snapshot which is to be exported because
otherwise enough of them happening during snapshot building will result
in export failure due to too big snapshot. But this seems like too
invasive change to be back-portable.

Ended up doing this in the 0002 and also use those changes in 0005, does
not seem to be that bad.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchtext/plain; charset=UTF-8; name=0001-Reserve-global-xmin-for-create-slot-snasphot-export.patchDownload
From 073dfa48f2361b8ee6a656bcbe57d11cad4cc2b3 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Fri, 24 Feb 2017 21:39:03 +0100
Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export

Otherwise the VACUUM or pruning might remove tuples still needed by the
exported snapshot.
---
 src/backend/replication/logical/logical.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..58e1c80 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
 	 * protecting against vacuum.
+	 *
+	 * Note that we only store the global xmin temporarily in the in-memory
+	 * state so that the initial snapshot can be exported. After initial
+	 * snapshot is done global xmin should be reset and not tracked anymore
+	 * so we are fine with losing the global xmin after crash.
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	slot->effective_xmin = slot->effective_catalog_xmin;
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
@@ -282,7 +288,7 @@ CreateInitDecodingContext(char *plugin,
 	 * tell the snapshot builder to only assemble snapshot once reaching the
 	 * running_xact's record with the respective xmin.
 	 */
-	xmin_horizon = slot->data.catalog_xmin;
+	xmin_horizon = slot->effective_xmin;
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
@@ -456,9 +462,28 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 void
 FreeDecodingContext(LogicalDecodingContext *ctx)
 {
+	ReplicationSlot *slot = MyReplicationSlot;
+
 	if (ctx->callbacks.shutdown_cb != NULL)
 		shutdown_cb_wrapper(ctx);
 
+	/*
+	 * Cleanup global xmin for the slot that we may have set in
+	 * CreateInitDecodingContext(). It's okay to do this here unconditionally
+	 * because we only care for the global xmin for exported snapshots and if
+	 * we exported one we used the required xmin for the current backend
+	 * proccess in SnapBuildInitialSnapshot().
+	 *
+	 * We do not take ProcArrayLock or similar since we only reset xmin here
+	 * and there's not much harm done by a concurrent computation missing
+	 * that and ReplicationSlotsComputeRequiredXmin will do locking as
+	 * neccessary.
+	 */
+	SpinLockAcquire(&slot->mutex);
+	slot->effective_xmin = InvalidTransactionId;
+	SpinLockRelease(&slot->mutex);
+	ReplicationSlotsComputeRequiredXmin(false);
+
 	ReorderBufferFree(ctx->reorder);
 	FreeSnapshotBuilder(ctx->snapshot_builder);
 	XLogReaderFree(ctx->reader);
-- 
2.7.4

0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchtext/plain; charset=UTF-8; name=0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patchDownload
From 163ae827097b5c2956fafa6c6884b58e3a53ae3b Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 20:14:44 +0100
Subject: [PATCH 2/5] Don't use on disk snapshots for snapshot export in
 logical decoding

We store historical snapshots on disk to enable continuation of logical
decoding after restart. These snapshots were also used bu slot
initialiation code for initial snapshot that the slot exports to aid
synchronization of data copy and the stream consumption. However
these snapshots are only useful for catalogs and not for normal user
tables. So when we exported such snapshots for user to read data from
tables that is consistent with a specific LSN of slot creation, user
would instead read wrong data.

This patch changes the code so that stored snapshots are not used when
slot creation needs full snapshot.
---
 src/backend/replication/logical/logical.c   | 10 +++++++---
 src/backend/replication/logical/snapbuild.c | 19 +++++++++++++------
 src/backend/replication/slotfuncs.c         |  2 +-
 src/backend/replication/walsender.c         |  7 ++++++-
 src/include/replication/logical.h           |  1 +
 src/include/replication/snapbuild.h         |  3 ++-
 6 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 58e1c80..79c1dd7 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -114,6 +114,7 @@ static LogicalDecodingContext *
 StartupDecodingContext(List *output_plugin_options,
 					   XLogRecPtr start_lsn,
 					   TransactionId xmin_horizon,
+					   bool need_full_snapshot,
 					   XLogPageReadCB read_page,
 					   LogicalOutputPluginWriterPrepareWrite prepare_write,
 					   LogicalOutputPluginWriterWrite do_write)
@@ -171,7 +172,8 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
-		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn);
+		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
+								need_full_snapshot);
 
 	ctx->reorder->private_data = ctx;
 
@@ -210,6 +212,7 @@ StartupDecodingContext(List *output_plugin_options,
 LogicalDecodingContext *
 CreateInitDecodingContext(char *plugin,
 						  List *output_plugin_options,
+						  bool need_full_snapshot,
 						  XLogPageReadCB read_page,
 						  LogicalOutputPluginWriterPrepareWrite prepare_write,
 						  LogicalOutputPluginWriterWrite do_write)
@@ -294,7 +297,8 @@ CreateInitDecodingContext(char *plugin,
 	ReplicationSlotSave();
 
 	ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
-								 read_page, prepare_write, do_write);
+								 need_full_snapshot, read_page, prepare_write,
+								 do_write);
 
 	/* call output plugin initialization callback */
 	old_context = MemoryContextSwitchTo(ctx->context);
@@ -383,7 +387,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	}
 
 	ctx = StartupDecodingContext(output_plugin_options,
-								 start_lsn, InvalidTransactionId,
+								 start_lsn, InvalidTransactionId, false,
 								 read_page, prepare_write, do_write);
 
 	/* call output plugin initialization callback */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2279604..ada618d 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,6 +165,9 @@ struct SnapBuild
 	 */
 	TransactionId initial_xmin_horizon;
 
+	/* Indicates if we are building full snapshot or just catalog one .*/
+	bool		building_full_snapshot;
+
 	/*
 	 * Snapshot that's valid to see the catalog state seen at this moment.
 	 */
@@ -281,7 +284,8 @@ static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
 SnapBuild *
 AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
-						XLogRecPtr start_lsn)
+						XLogRecPtr start_lsn,
+						bool need_full_snapshot)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -308,6 +312,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
+	builder->building_full_snapshot = need_full_snapshot;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1233,7 +1238,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 *	  state we were waiting for b) or c).
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1248,7 +1253,9 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  at all.
 	 *
 	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.
+	 *	  snapshot to disk that we can use. We can't use this method for the
+	 *	  initial snapshot when slot is being created and needs full snapshot
+	 *	  for export or direct use.
 	 * ---
 	 */
 
@@ -1303,13 +1310,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
+	/* c) valid on disk state and not full snapshot */
+	else if (!builder->building_full_snapshot &&
+			 SnapBuildRestore(builder, lsn))
 	{
 		/* there won't be any state to cleanup */
 		return false;
 	}
-
 	/*
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 7104c94..9775735 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -132,7 +132,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * Create logical decoding context, to build the initial snapshot.
 	 */
 	ctx = CreateInitDecodingContext(
-									NameStr(*plugin), NIL,
+									NameStr(*plugin), NIL, false,
 									logical_read_local_xlog_page, NULL, NULL);
 
 	/* build initial snapshot, might take a while */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dbb10c7..2784d67 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -873,6 +873,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
 	{
 		LogicalDecodingContext *ctx;
+		bool	need_full_snapshot = false;
 
 		/*
 		 * Do options check early so that we can bail before calling the
@@ -884,6 +885,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 				ereport(ERROR,
 						(errmsg("CREATE_REPLICATION_SLOT ... EXPORT_SNAPSHOT "
 								"must not be called inside a transaction")));
+
+			need_full_snapshot = true;
 		}
 		else if (snapshot_action == CRS_USE_SNAPSHOT)
 		{
@@ -906,9 +909,11 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 				ereport(ERROR,
 						(errmsg("CREATE_REPLICATION_SLOT ... USE_SNAPSHOT "
 								"must not be called in a subtransaction")));
+
+			need_full_snapshot = true;
 		}
 
-		ctx = CreateInitDecodingContext(cmd->plugin, NIL,
+		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										logical_read_xlog_page,
 										WalSndPrepareWrite, WalSndWriteData);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7d6c88e..80f04c3 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -82,6 +82,7 @@ extern void CheckLogicalDecodingRequirements(void);
 
 extern LogicalDecodingContext *CreateInitDecodingContext(char *plugin,
 						  List *output_plugin_options,
+						  bool need_full_snapshot,
 						  XLogPageReadCB read_page,
 						  LogicalOutputPluginWriterPrepareWrite prepare_write,
 						  LogicalOutputPluginWriterWrite do_write);
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a8ae631..494751d 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -54,7 +54,8 @@ struct xl_running_xacts;
 extern void CheckPointSnapBuild(void);
 
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
-						TransactionId xmin_horizon, XLogRecPtr start_lsn);
+						TransactionId xmin_horizon, XLogRecPtr start_lsn,
+						bool need_full_snapshot);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
-- 
2.7.4

0003-Prevent-snapshot-builder-xmin-from-going-backwards.patchtext/plain; charset=UTF-8; name=0003-Prevent-snapshot-builder-xmin-from-going-backwards.patchDownload
From ae60b52ae0ca96bc14169cd507f101fbb5dfdf52 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

Logical decoding snapshot builder may encounter xl_running_xacts with
older xmin than the xmin of the builder. This can happen because
LogStandbySnapshot() sometimes sees already committed transactions as
running (there is difference between "running" in terms for WAL and in
terms of ProcArray). When this happens we must make sure that the xmin
of snapshot builder won't go back otherwise the resulting snapshot would
show some transaction as running even though they have already
committed.
---
 src/backend/replication/logical/snapbuild.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ada618d..3e34f75 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1165,7 +1165,8 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * looking, it's correct and actually more efficient this way since we hit
 	 * fast paths in tqual.c.
 	 */
-	builder->xmin = running->oldestRunningXid;
+	if (TransactionIdFollowsOrEquals(running->oldestRunningXid, builder->xmin))
+		builder->xmin = running->oldestRunningXid;
 
 	/* Remove transactions we don't need to keep track off anymore */
 	SnapBuildPurgeCommittedTxn(builder);
-- 
2.7.4

0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patchtext/plain; charset=UTF-8; name=0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patchDownload
From 1f9d3fe6f1fb9a9b39ea6bd9e1776a769fac8ea9 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.

This also reverts changes made to GetRunningTransactionData() and
LogStandbySnapshot() by b89e151 as the additional locking does not help.
---
 src/backend/replication/logical/snapbuild.c | 195 ++++++++++++++++++----------
 src/backend/storage/ipc/procarray.c         |   5 +-
 src/backend/storage/ipc/standby.c           |  19 ---
 3 files changed, 130 insertions(+), 89 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3e34f75..d989576 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1220,6 +1220,82 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 										  builder->last_serialized_snapshot);
 }
 
+/*
+ * Start tracking transactions based on the info we get from xl_running_xacts.
+ */
+static void
+SnapBuildStartXactTracking(SnapBuild *builder, xl_running_xacts *running)
+{
+	int			off;
+
+	/*
+	 * We only care about toplevel xids as those are the ones we
+	 * definitely see in the wal stream. As snapbuild.c tracks committed
+	 * instead of running transactions we don't need to know anything
+	 * about uncommitted subtransactions.
+	 */
+
+	/*
+	 * Start with an xmin/xmax that's correct for future, when all the
+	 * currently running transactions have finished. We'll update both
+	 * while waiting for the pending transactions to finish.
+	 */
+	builder->xmin = running->nextXid;		/* < are finished */
+	builder->xmax = running->nextXid;		/* >= are running */
+
+	/* so we can safely use the faster comparisons */
+	Assert(TransactionIdIsNormal(builder->xmin));
+	Assert(TransactionIdIsNormal(builder->xmax));
+
+	builder->running.xcnt = running->xcnt;
+	builder->running.xcnt_space = running->xcnt;
+	builder->running.xip =
+		MemoryContextAlloc(builder->context,
+						   builder->running.xcnt * sizeof(TransactionId));
+	memcpy(builder->running.xip, running->xids,
+		   builder->running.xcnt * sizeof(TransactionId));
+
+	/* sort so we can do a binary search */
+	qsort(builder->running.xip, builder->running.xcnt,
+		  sizeof(TransactionId), xidComparator);
+
+	builder->running.xmin = builder->running.xip[0];
+	builder->running.xmax = builder->running.xip[running->xcnt - 1];
+
+
+	/* makes comparisons cheaper later */
+	TransactionIdRetreat(builder->running.xmin);
+	TransactionIdAdvance(builder->running.xmax);
+
+	builder->state = SNAPBUILD_FULL_SNAPSHOT;
+
+	/*
+	 * Iterate through all xids, wait for them to finish.
+	 *
+	 * This isn't required for the correctness of decoding, but to allow
+	 * isolationtester to notice that we're currently waiting for
+	 * something.
+	 */
+	for (off = 0; off < builder->running.xcnt; off++)
+	{
+		TransactionId xid = builder->running.xip[off];
+
+		/*
+		 * Upper layers should prevent that we ever need to wait on
+		 * ourselves. Check anyway, since failing to do so would either
+		 * result in an endless wait or an Assert() failure.
+		 */
+		if (TransactionIdIsCurrentTransactionId(xid))
+			elog(ERROR, "waiting for ourselves");
+
+		/*
+		 * This isn't required for the correctness of decoding, but to allow
+		 * isolationtester to notice that we're currently waiting for
+		 * something.
+		 */
+		XactLockTableWait(xid, NULL, NULL, XLTW_None);
+	}
+}
 
 /*
  * Build the start of a snapshot that's capable of decoding the catalog.
@@ -1241,7 +1317,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
 	 *	  state we were waiting for b) or c).
 	 *
-	 * b) Wait for all toplevel transactions that were running to end. We
+	 * b) This (in a previous run) or another decoding slot serialized a
+	 *	  snapshot to disk that we can use. We can't use this method for the
+	 *	  initial snapshot when slot is being created and needs full snapshot
+	 *	  for export or direct use.
+
+	 * c) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
@@ -1252,11 +1333,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  Interestingly, in contrast to HS, this allows us not to care about
 	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
 	 *	  at all.
-	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use. We can't use this method for the
-	 *	  initial snapshot when slot is being created and needs full snapshot
-	 *	  for export or direct use.
 	 * ---
 	 */
 
@@ -1311,7 +1387,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state and not full snapshot */
+	/* b) valid on disk state and not full snapshot */
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
@@ -1319,54 +1395,14 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		return false;
 	}
 	/*
-	 * b) first encounter of a useable xl_running_xacts record. If we had
+	 * c) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
 	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
 	 * get called).
 	 */
 	else if (!builder->running.xcnt)
 	{
-		int			off;
-
-		/*
-		 * We only care about toplevel xids as those are the ones we
-		 * definitely see in the wal stream. As snapbuild.c tracks committed
-		 * instead of running transactions we don't need to know anything
-		 * about uncommitted subtransactions.
-		 */
-
-		/*
-		 * Start with an xmin/xmax that's correct for future, when all the
-		 * currently running transactions have finished. We'll update both
-		 * while waiting for the pending transactions to finish.
-		 */
-		builder->xmin = running->nextXid;		/* < are finished */
-		builder->xmax = running->nextXid;		/* >= are running */
-
-		/* so we can safely use the faster comparisons */
-		Assert(TransactionIdIsNormal(builder->xmin));
-		Assert(TransactionIdIsNormal(builder->xmax));
-
-		builder->running.xcnt = running->xcnt;
-		builder->running.xcnt_space = running->xcnt;
-		builder->running.xip =
-			MemoryContextAlloc(builder->context,
-							   builder->running.xcnt * sizeof(TransactionId));
-		memcpy(builder->running.xip, running->xids,
-			   builder->running.xcnt * sizeof(TransactionId));
-
-		/* sort so we can do a binary search */
-		qsort(builder->running.xip, builder->running.xcnt,
-			  sizeof(TransactionId), xidComparator);
-
-		builder->running.xmin = builder->running.xip[0];
-		builder->running.xmax = builder->running.xip[running->xcnt - 1];
-
-		/* makes comparisons cheaper later */
-		TransactionIdRetreat(builder->running.xmin);
-		TransactionIdAdvance(builder->running.xmax);
-
-		builder->state = SNAPBUILD_FULL_SNAPSHOT;
+		SnapBuildStartXactTracking(builder, running);
 
 		ereport(LOG,
 			(errmsg("logical decoding found initial starting point at %X/%X",
@@ -1376,30 +1412,53 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 							  builder->running.xcnt,
 							  (uint32) builder->running.xcnt)));
 
+		/* nothing could have built up so far, so don't perform cleanup */
+		return false;
+	}
+	/*
+	 * c) we already seen the xl_running_xacts and tried to do the above.
+	 * However because of race condition in LogStandbySnapshot() there might
+	 * have been transaction reported as running but in reality has already
+	 * written commit record before the xl_running_xacts so decoding has
+	 * missed it. We now see xl_running_xacts that suggests all transactions
+	 * from the original one were closed but the consistent state wasn't
+	 * reached which means the race condition has indeed happened.
+	 *
+	 * Start tracking again as if this was the first xl_running_xacts we've
+	 * seen, with the advantage that because decoding was already running,
+	 * any transactions committed before the xl_running_xacts record will be
+	 * known to us so we won't hit with the same issue again.
+	 */
+	else if (TransactionIdFollows(running->oldestRunningXid,
+								  builder->running.xmax))
+	{
+		int off;
+
+		SnapBuildStartXactTracking(builder, running);
+
 		/*
-		 * Iterate through all xids, wait for them to finish.
+		 * Nark any transactions that are known to have committed before the
+		 * xl_running_xacts as finished to avoid the race condition in
+		 * LogStandbySnapshot().
 		 *
-		 * This isn't required for the correctness of decoding, but to allow
-		 * isolationtester to notice that we're currently waiting for
-		 * something.
+		 * We can use SnapBuildEndTxn directly as it only does the
+		 * transaction running check and handling without any additional
+		 * side effects.
 		 */
-		for (off = 0; off < builder->running.xcnt; off++)
-		{
-			TransactionId xid = builder->running.xip[off];
-
-			/*
-			 * Upper layers should prevent that we ever need to wait on
-			 * ourselves. Check anyway, since failing to do so would either
-			 * result in an endless wait or an Assert() failure.
-			 */
-			if (TransactionIdIsCurrentTransactionId(xid))
-				elog(ERROR, "waiting for ourselves");
+		for (off = 0; off < builder->committed.xcnt; off++)
+			SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
 
-			XactLockTableWait(xid, NULL, NULL, XLTW_None);
-		}
+		/* We might have reached consistent point now. */
+		if (builder->state == SNAPBUILD_CONSISTENT)
+			return false;
 
-		/* nothing could have built up so far, so don't perform cleanup */
-		return false;
+		ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ebf6a92..b3d6829 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2060,12 +2060,13 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
 
+	/* We don't release XidGenLock here, the caller is responsible for that */
+	LWLockRelease(ProcArrayLock);
+
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
 	Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));
 
-	/* We don't release the locks here, the caller is responsible for that */
-
 	return CurrentRunningXacts;
 }
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f93..ddb279e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -929,27 +929,8 @@ LogStandbySnapshot(void)
 	 */
 	running = GetRunningTransactionData();
 
-	/*
-	 * GetRunningTransactionData() acquired ProcArrayLock, we must release it.
-	 * For Hot Standby this can be done before inserting the WAL record
-	 * because ProcArrayApplyRecoveryInfo() rechecks the commit status using
-	 * the clog. For logical decoding, though, the lock can't be released
-	 * early because the clog might be "in the future" from the POV of the
-	 * historic snapshot. This would allow for situations where we're waiting
-	 * for the end of a transaction listed in the xl_running_xacts record
-	 * which, according to the WAL, has committed before the xl_running_xacts
-	 * record. Fortunately this routine isn't executed frequently, and it's
-	 * only a shared lock.
-	 */
-	if (wal_level < WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	recptr = LogCurrentRunningXacts(running);
 
-	/* Release lock if we kept it longer ... */
-	if (wal_level >= WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
-- 
2.7.4

0005-Skip-unnecessary-snapshot-builds.patchtext/plain; charset=UTF-8; name=0005-Skip-unnecessary-snapshot-builds.patchDownload
From ff27b1fc7099fa668a9e28daa28bfee1ad9410bd Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 5/5] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
 src/backend/replication/logical/snapbuild.c | 100 ++++++++++++++++------------
 1 file changed, 58 insertions(+), 42 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index d989576..916b297 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -975,9 +975,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 {
 	int			nxact;
 
-	bool		forced_timetravel = false;
-	bool		sub_needs_timetravel = false;
-	bool		top_needs_timetravel = false;
+	bool		need_timetravel = false;
+	bool		need_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -997,12 +996,22 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			builder->start_decoding_at = lsn + 1;
 
 		/*
-		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
-		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * When building full snapshot we need to keep track of all
+		 * transactions.
 		 */
-		forced_timetravel = true;
-		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+		if (builder->building_full_snapshot)
+		{
+			need_timetravel = true;
+			elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+		}
+
+		/*
+		 * If we could not observe the just finished transaction since it
+		 * started (because it started before we started tracking), we'll
+		 * always need a snapshot.
+		 */
+		if (SnapBuildTxnIsRunning(builder, xid))
+			need_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -1015,23 +1024,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
-			sub_needs_timetravel = true;
+			need_timetravel = true;
+			need_snapshot = true;
 
 			elog(DEBUG1, "found subtransaction %u:%u with catalog changes.",
 				 xid, subxid);
@@ -1041,6 +1040,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we have already decided that timetravel is needed for this
+		 * transaction, we also need visibility information about
+		 * subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (need_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
 	/*
@@ -1049,29 +1059,27 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
-	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	/*
+	 * Add toplevel transaction to base snapshot if it made any cataog
+	 * changes...
+	 */
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
 
-		top_needs_timetravel = true;
+		need_timetravel = true;
+		need_snapshot = true;
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
-	else if (sub_needs_timetravel)
+	/* ... or if previous checks decided we need timetravel anyway. */
+	else if (need_timetravel)
 	{
-		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
-	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
+	if (need_timetravel)
 	{
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
@@ -1092,15 +1100,25 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
 			return;
 
+		/* We always need to build snapshot if there isn't one yet. */
+		need_snapshot = need_snapshot || !builder->snapshot;
+
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
 		 */
-		if (builder->snapshot)
+		if (builder->snapshot && need_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (need_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1110,11 +1128,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (need_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

#33Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#28)
Re: snapbuild woes

On Thu, Apr 13, 2017 at 12:58:12AM -0400, Noah Misch wrote:

On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:

On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:

On 4/12/17 02:31, Noah Misch wrote:

But I hope you mean to commit these snapbuild patches before the postgres 10
release? As far as I know, logical replication is still very broken without
them (or at least some of that set of 5 patches - I don't know which ones
are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

[Action required within three days. This is a generic notification.]

I'm hoping for a word from Andres on this.

Feel free to reassign to me.

Thanks for volunteering; I'll do that shortly. Please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#33)
Re: snapbuild woes

On 2017-04-16 22:04:04 -0400, Noah Misch wrote:

On Thu, Apr 13, 2017 at 12:58:12AM -0400, Noah Misch wrote:

On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:

On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:

On 4/12/17 02:31, Noah Misch wrote:

But I hope you mean to commit these snapbuild patches before the postgres 10
release? As far as I know, logical replication is still very broken without
them (or at least some of that set of 5 patches - I don't know which ones
are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

[Action required within three days. This is a generic notification.]

I'm hoping for a word from Andres on this.

Feel free to reassign to me.

Thanks for volunteering; I'll do that shortly. Please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.

I've since the previous update reviewed Petr's patch, which he since has
updated over the weekend. I'll do another round tomorrow, and will see
how it looks. I think we might need some more tests for this to be
committable, so it might not become committable tomorrow. I hope we'll
have something in tree by end of this week, if not I'll send an update.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#34)
Re: snapbuild woes

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I think we might need some more tests for this to be committable, so
it might not become committable tomorrow.

I'm working on some infrastructure around this. Not sure if it needs to
be committed, but it's certainly useful for evaluation. Basically it's
a small UDF that:
1) creates a slot via walsender protocol (to some dsn)
2) imports that snapshot into yet another connection to that dsn
3) runs some query over that new connection

That makes it reasonably easy to run e.g. pgbench and continually create
slots, and use the snapshot to run queries "verifying" that things look
good. It's a bit shoestring-ed together, but everything else seems to
require more code. And it's just test.

Unless somebody has a better idea?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#35)
Re: snapbuild woes

On 20/04/17 02:09, Andres Freund wrote:

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I think we might need some more tests for this to be committable, so
it might not become committable tomorrow.

I'm working on some infrastructure around this. Not sure if it needs to
be committed, but it's certainly useful for evaluation. Basically it's
a small UDF that:
1) creates a slot via walsender protocol (to some dsn)
2) imports that snapshot into yet another connection to that dsn
3) runs some query over that new connection

That makes it reasonably easy to run e.g. pgbench and continually create
slots, and use the snapshot to run queries "verifying" that things look
good. It's a bit shoestring-ed together, but everything else seems to
require more code. And it's just test.

Unless somebody has a better idea?

I don't. I mean it would be nice to have isolation tester support
walsender protocol, but I don't know anything about isolation tester
internals so no idea how much work that is. On top of that some of the
issues are not even possible to provoke via isolation tester (or
anything similar that would give us control over timing) unless we
expose a lot of guts of xlog/xact as a UDFs as well. So I think simple
function that does what you said and pgbench are reasonable solutions. I
guess you plan to make that as one of the test/modules or something
similar (the UDF)?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#36)
Re: snapbuild woes

On 2017-04-20 13:32:10 +0200, Petr Jelinek wrote:

On 20/04/17 02:09, Andres Freund wrote:

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
I'm working on some infrastructure around this. Not sure if it needs to
be committed, but it's certainly useful for evaluation. Basically it's
a small UDF that:
1) creates a slot via walsender protocol (to some dsn)
2) imports that snapshot into yet another connection to that dsn
3) runs some query over that new connection

That makes it reasonably easy to run e.g. pgbench and continually create
slots, and use the snapshot to run queries "verifying" that things look
good. It's a bit shoestring-ed together, but everything else seems to
require more code. And it's just test.

Unless somebody has a better idea?

I don't. I mean it would be nice to have isolation tester support
walsender protocol, but I don't know anything about isolation tester
internals so no idea how much work that is.

Now that the replication protocol supports normal queries, it's actually
not much of an issue on its own. The problem is more that
isolationtester's client side language isn't powerfull enough - you
can't extract the snapshot name from one session and import it in
another. While that might be something we want to address, I certainly
don't want to tackle it for v10.

I'd started to develop a C toolkit as above, but after I got the basics
running I actually noticed it's pretty much unnecessary: You can just as
well do it with dblink and some plpgsql.

I can reliably reproduce several of the bugs in this thread in
relatively short amount of time before applying the patch, and so far
not after. Thats great!

I guess you plan to make that as one of the test/modules or something
similar (the UDF)?

I've a bunch of tests, but I don't quite know whether we can expose all
of them via classical tests. There are several easy ones that I
definitely want to add (import "empty" snapshot; import snapshot with
running xacts; create snapshot, perform some ddl, import snapshot,
perform some ddl, check things work reasonably crazy), but there's
enough others that are just probabilistic. I was wondering about adding
a loop that simply runs for like 30s and then quits or such, but who
knows.

Testing around this made me wonder whether we need to make bgwriter.c's
LOG_SNAPSHOT_INTERVAL_MS configurable - for efficient testing reducing
it is quite valuable, and on busier machines it'll also almost always be
a win to log more frequently... Opinions?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#34)
Re: snapbuild woes

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I've since the previous update reviewed Petr's patch, which he since has
updated over the weekend. I'll do another round tomorrow, and will see
how it looks. I think we might need some more tests for this to be
committable, so it might not become committable tomorrow. I hope we'll
have something in tree by end of this week, if not I'll send an update.

I was less productive this week than I'd hoped, and creating a testsuite
was more work than I'd anticipated, so I'm slightly lagging behind. I
hope to have a patchset tomorrow, aiming to commit something
Monday/Tuesday.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#32)
1 attachment(s)
Re: snapbuild woes

Hi,

On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:

Hi, here is updated patch (details inline).

I'm not yet all that happy, sorry:

Looking at 0001:
- GetOldestSafeDecodingTransactionId() only guarantees to return an xid
safe for decoding (note how procArray->replication_slot_catalog_xmin
is checked), not one for the initial snapshot - so afaics this whole
exercise doesn't guarantee much so far.
- A later commit introduces need_full_snapshot as a
CreateInitDecodingContext, but you don't use it, not here. That seems
wrong.
- I remain unhappy with the handling of the reset of effective_xmin in
FreeDecodingContext(). What if we ERROR/FATAL out before that happens?

What do you think about something like the attached? I've not yet
tested it any way except running the regression tests.

- Andres

Attachments:

0001-Preserve-required-catalog-tuples-while-computing-ini.patchtext/x-patch; charset=us-asciiDownload
From b20c8e1edb31d517ecb714467a7acbeec1b926dc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Apr 2017 20:41:29 -0700
Subject: [PATCH] Preserve required !catalog tuples while computing initial
 decoding snapshot.

The logical decoding machinery already preserved all the required
catalog tuples, which is sufficient in the course of normal logical
decoding, but did not guarantee that non-catalog tuples were preserved
during computation of the initial snapshot when creating a slot over
the replication protocol.

This could cause a corrupted initial snapshot being exported.  The
time window for issues is usually not terribly large, but on a busy
server it's perfectly possible to it hit it.  Ongoing decoding is not
affected by this bug.

To avoid increased overhead for the SQL API, only retain additional
tuples when a logical slot is being created over the replication
protocol.  To do so this commit changes the signature of
CreateInitDecodingContext(), but it seems unlikely that it's being
used in an extension, so that's probably ok.

In a drive-by fix, fix handling of
ReplicationSlotsComputeRequiredXmin's already_locked argument, which
should only apply to ProcArrayLock, not ReplicationSlotControlLock.

Reported-By: Erik Rijkers
Analyzed-By: Petr Jelinek
Author: Petr Jelinek, heavily editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/9a897b86-46e1-9915-ee4c-da02e4ff6a95@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
---
 src/backend/replication/logical/logical.c   | 25 +++++++++++++++++--------
 src/backend/replication/logical/snapbuild.c | 12 ++++++++++++
 src/backend/replication/slot.c              | 25 +++++++++++++++++++++----
 src/backend/replication/slotfuncs.c         |  4 ++--
 src/backend/replication/walsender.c         |  1 +
 src/backend/storage/ipc/procarray.c         | 14 +++++++++++---
 src/include/replication/logical.h           |  1 +
 src/include/storage/procarray.h             |  2 +-
 8 files changed, 66 insertions(+), 18 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8fb4..032e91c371 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -210,6 +210,7 @@ StartupDecodingContext(List *output_plugin_options,
 LogicalDecodingContext *
 CreateInitDecodingContext(char *plugin,
 						  List *output_plugin_options,
+						  bool need_full_snapshot,
 						  XLogPageReadCB read_page,
 						  LogicalOutputPluginWriterPrepareWrite prepare_write,
 						  LogicalOutputPluginWriterWrite do_write)
@@ -267,23 +268,31 @@ CreateInitDecodingContext(char *plugin,
 	 * the slot machinery about the new limit. Once that's done the
 	 * ProcArrayLock can be released as the slot machinery now is
 	 * protecting against vacuum.
+	 *
+	 * Note that, temporarily, the data, not just the catalog, xmin has to be
+	 * reserved if a data snapshot is to be exported.  Otherwise the initial
+	 * data snapshot created here is not guaranteed to be valid. After that
+	 * the data xmin doesn't need to be managed anymore and the global xmin
+	 * should be recomputed. As we are fine with losing the pegged data xmin
+	 * after crash - no chance a snapshot would get exported anymore - we can
+	 * get away with just setting the slot's
+	 * effective_xmin. ReplicationSlotRelease will reset it again.
+	 *
 	 * ----
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
-	slot->data.catalog_xmin = slot->effective_catalog_xmin;
+	xmin_horizon = GetOldestSafeDecodingTransactionId(need_full_snapshot);
+
+	slot->effective_catalog_xmin = xmin_horizon;
+	slot->data.catalog_xmin = xmin_horizon;
+	if (need_full_snapshot)
+		slot->effective_xmin = xmin_horizon;
 
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
 
-	/*
-	 * tell the snapshot builder to only assemble snapshot once reaching the
-	 * running_xact's record with the respective xmin.
-	 */
-	xmin_horizon = slot->data.catalog_xmin;
-
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 358ec28932..458a52b68b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -606,6 +606,18 @@ SnapBuildExportSnapshot(SnapBuild *builder)
 
 	snap = SnapBuildInitialSnapshot(builder);
 
+#ifdef USE_ASSERT_CHECKING
+	{
+		TransactionId safeXid;
+
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		safeXid = GetOldestSafeDecodingTransactionId(true);
+		LWLockRelease(ProcArrayLock);
+
+		Assert(TransactionIdPrecedesOrEquals(safeXid, builder->xmin));
+	}
+#endif
+
 	/*
 	 * now that we've built a plain snapshot, make it active and use the
 	 * normal mechanisms for exporting it
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e8ad0f7b39..5f63d0484a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -398,6 +398,22 @@ ReplicationSlotRelease(void)
 		SpinLockRelease(&slot->mutex);
 	}
 
+
+	/*
+	 * If slot needed to temporarily restrain both data and catalog xmin to
+	 * create the catalog snapshot, remove that temporary constraint.
+	 * Snapshots can only be exported while the initial snapshot is still
+	 * acquired.
+	 */
+	if (!TransactionIdIsValid(slot->data.xmin) &&
+		TransactionIdIsValid(slot->effective_xmin))
+	{
+		SpinLockAcquire(&slot->mutex);
+		slot->effective_xmin = InvalidTransactionId;
+		SpinLockRelease(&slot->mutex);
+		ReplicationSlotsComputeRequiredXmin(false);
+	}
+
 	MyReplicationSlot = NULL;
 
 	/* might not have been set when we've been a plain slot */
@@ -612,6 +628,9 @@ ReplicationSlotPersist(void)
 
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
+ *
+ * If already_locked is true, ProcArrayLock has already been acquired
+ * exclusively.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -622,8 +641,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	Assert(ReplicationSlotCtl != NULL);
 
-	if (!already_locked)
-		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -652,8 +670,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	if (!already_locked)
-		LWLockRelease(ReplicationSlotControlLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
 }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 7104c94795..6ee1e68819 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -131,8 +131,8 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
 	 */
-	ctx = CreateInitDecodingContext(
-									NameStr(*plugin), NIL,
+	ctx = CreateInitDecodingContext(NameStr(*plugin), NIL,
+									false, /* do not build snapshot */
 									logical_read_local_xlog_page, NULL, NULL);
 
 	/* build initial snapshot, might take a while */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 064cf5ee28..43c8a73f3e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -909,6 +909,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		}
 
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL,
+										true, /* build snapshot */
 										logical_read_xlog_page,
 										WalSndPrepareWrite, WalSndWriteData);
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ebf6a92923..233eb606f5 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2151,7 +2151,7 @@ GetOldestActiveTransactionId(void)
  * that the caller will immediately use the xid to peg the xmin horizon.
  */
 TransactionId
-GetOldestSafeDecodingTransactionId(void)
+GetOldestSafeDecodingTransactionId(bool catalogOnly)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId oldestSafeXid;
@@ -2174,9 +2174,17 @@ GetOldestSafeDecodingTransactionId(void)
 	/*
 	 * If there's already a slot pegging the xmin horizon, we can start with
 	 * that value, it's guaranteed to be safe since it's computed by this
-	 * routine initially and has been enforced since.
+	 * routine initially and has been enforced since.  We can always use the
+	 * slot's general xmin horizon, but the catalog horizon is only usable
+	 * when we only catalog data is going to be looked at.
 	 */
-	if (TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
+	if (TransactionIdIsValid(procArray->replication_slot_xmin) &&
+		TransactionIdPrecedes(procArray->replication_slot_xmin,
+							  oldestSafeXid))
+		oldestSafeXid = procArray->replication_slot_xmin;
+
+	if (catalogOnly &&
+		TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
 		TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
 							  oldestSafeXid))
 		oldestSafeXid = procArray->replication_slot_catalog_xmin;
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7d6c88efe3..80f04c3cb9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -82,6 +82,7 @@ extern void CheckLogicalDecodingRequirements(void);
 
 extern LogicalDecodingContext *CreateInitDecodingContext(char *plugin,
 						  List *output_plugin_options,
+						  bool need_full_snapshot,
 						  XLogPageReadCB read_page,
 						  LogicalOutputPluginWriterPrepareWrite prepare_write,
 						  LogicalOutputPluginWriterWrite do_write);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9b42e49524..805ecd25ec 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -88,7 +88,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
-extern TransactionId GetOldestSafeDecodingTransactionId(void);
+extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 
 extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
 extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
-- 
2.12.0.264.gd6db3f2165.dirty

#40Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#39)
Re: snapbuild woes

On 25/04/17 00:59, Andres Freund wrote:

Hi,

On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:

Hi, here is updated patch (details inline).

I'm not yet all that happy, sorry:

Looking at 0001:
- GetOldestSafeDecodingTransactionId() only guarantees to return an xid
safe for decoding (note how procArray->replication_slot_catalog_xmin
is checked), not one for the initial snapshot - so afaics this whole
exercise doesn't guarantee much so far.
- A later commit introduces need_full_snapshot as a
CreateInitDecodingContext, but you don't use it, not here. That seems
wrong.

Ah yeah looks like that optimization is useful even here.

- I remain unhappy with the handling of the reset of effective_xmin in
FreeDecodingContext(). What if we ERROR/FATAL out before that happens?

Oh your problem was that I did it in FreeDecodingContext() instead of
slot release, that I didn't get, yeah sure it's possibly better place.

What do you think about something like the attached? I've not yet
tested it any way except running the regression tests.

-	if (!already_locked)
-		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

Don't really understand this change much, but otherwise the patch looks
good to me.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#38)
Re: snapbuild woes

On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I've since the previous update reviewed Petr's patch, which he since has
updated over the weekend. I'll do another round tomorrow, and will see
how it looks. I think we might need some more tests for this to be
committable, so it might not become committable tomorrow. I hope we'll
have something in tree by end of this week, if not I'll send an update.

I was less productive this week than I'd hoped, and creating a testsuite
was more work than I'd anticipated, so I'm slightly lagging behind. I
hope to have a patchset tomorrow, aiming to commit something
Monday/Tuesday.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#41)
Re: snapbuild woes

On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I've since the previous update reviewed Petr's patch, which he

since has

updated over the weekend. I'll do another round tomorrow, and will

see

how it looks. I think we might need some more tests for this to be
committable, so it might not become committable tomorrow. I hope

we'll

have something in tree by end of this week, if not I'll send an

update.

I was less productive this week than I'd hoped, and creating a

testsuite

was more work than I'd anticipated, so I'm slightly lagging behind.

I

hope to have a patchset tomorrow, aiming to commit something
Monday/Tuesday.

This PostgreSQL 10 open item is past due for your status update.
Kindly send
a status update within 24 hours, and include a date for your subsequent
status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I committed part of the series today, plan to continue doing so over the next few days. Changes require careful review & testing, this is easy to get wrong...
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#37)
Re: snapbuild woes

On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:

I've a bunch of tests, but I don't quite know whether we can expose all
of them via classical tests. There are several easy ones that I
definitely want to add (import "empty" snapshot; import snapshot with
running xacts; create snapshot, perform some ddl, import snapshot,
perform some ddl, check things work reasonably crazy), but there's
enough others that are just probabilistic. I was wondering about adding
a loop that simply runs for like 30s and then quits or such, but who
knows.

If the probabilistic test catches the bug even 5% of the time in typical
configurations, the buildfarm will rapidly identify any regression. I'd
choose a 7s test that detects the bug 5% of the time over a 30s test that
detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl
for a probabilistic bug, I sized that test to finish in 1s and catch its bug
half the time. In its case, only two buildfarm members were able to
demonstrate the original bug, so 5% detection would have been too low.)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#32)
Re: snapbuild woes

Hi,

On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:

+	/*
+	 * c) we already seen the xl_running_xacts and tried to do the above.
+	 * However because of race condition in LogStandbySnapshot() there might
+	 * have been transaction reported as running but in reality has already
+	 * written commit record before the xl_running_xacts so decoding has
+	 * missed it. We now see xl_running_xacts that suggests all transactions
+	 * from the original one were closed but the consistent state wasn't
+	 * reached which means the race condition has indeed happened.
+	 *
+	 * Start tracking again as if this was the first xl_running_xacts we've
+	 * seen, with the advantage that because decoding was already running,
+	 * any transactions committed before the xl_running_xacts record will be
+	 * known to us so we won't hit with the same issue again.
+	 */

Unfortunately I don't think that's true, as coded. You're using
information about committed transactions:

+	else if (TransactionIdFollows(running->oldestRunningXid,
+								  builder->running.xmax))
+	{
+		int off;
+
+		SnapBuildStartXactTracking(builder, running);
+
/*
+		 * Nark any transactions that are known to have committed before the
+		 * xl_running_xacts as finished to avoid the race condition in
+		 * LogStandbySnapshot().
*
+		 * We can use SnapBuildEndTxn directly as it only does the
+		 * transaction running check and handling without any additional
+		 * side effects.
*/
+		for (off = 0; off < builder->committed.xcnt; off++)
+			SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);

but a transaction might just have *aborted* before the new snapshot, no?
Since we don't keep track of those, I don't think this guarantees anything?

ISTM, we need a xip_status array in SnapBuild->running. Then, whenever
a xl_running_xacts is encountered while builder->running.xcnt > 0, loop
for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but
the xid is not xl_running_xacts? Does that make sense?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#44)
1 attachment(s)
Re: snapbuild woes

On 2017-04-30 17:59:21 -0700, Andres Freund wrote:

ISTM, we need a xip_status array in SnapBuild->running. Then, whenever
a xl_running_xacts is encountered while builder->running.xcnt > 0, loop
for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but
the xid is not xl_running_xacts? Does that make sense?

A hasty implementation, untested besides check-world, of that approach
is attached. I'm going out for dinner now, will subject it to mean
things afterwards.

Needs more testing, comment and commit message policing. But I think
the idea is sound?

- Andres

Attachments:

0001-Fix-initial-logical-decoding-snapshat-race-condition.patchtext/x-patch; charset=us-asciiDownload
From 6160eda5177e7f538ff13eacc46acb2f2959c257 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 30 Apr 2017 18:24:06 -0700
Subject: [PATCH] Fix initial logical decoding snapshat race condition.

LogStandbySnapshot() has no interlock preventing
RecordTransactionCommit() from logging its commit record (abort
situation is similar), before the xl_running_xacts record is
generated.  That can lead to the situation that logical decoding
forever waits for a commit/abort record, that was logged in the past.

To fix, check for such "missed" transactions when xl_running_xacts are
observed, while in SNAPBUILD_FULL_SNAPSHOT.

This also reverts changes made to GetRunningTransactionData() and
LogStandbySnapshot() by b89e151 as the additional locking does not
solve the problem, and is not required anymore.

Author: Petr Jelined and Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4, where logical decoding was introduced
---
 src/backend/replication/logical/snapbuild.c | 98 +++++++++++++++++++++++------
 src/backend/storage/ipc/procarray.c         |  5 +-
 src/backend/storage/ipc/standby.c           | 19 ------
 3 files changed, 81 insertions(+), 41 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 068d214fa1..8605625555 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -203,6 +203,7 @@ struct SnapBuild
 		size_t		xcnt;		/* number of used xip entries */
 		size_t		xcnt_space; /* allocated size of xip */
 		TransactionId *xip;		/* running xacts array, xidComparator-sorted */
+		bool	   *xip_running; /* xid in ->xip still running? */
 	}			running;
 
 	/*
@@ -253,7 +254,7 @@ static bool ExportInProgress = false;
 static void SnapBuildEndTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* ->running manipulation */
-static bool SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid);
+static bool SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid, int *off);
 
 /* ->committed manipulation */
 static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
@@ -700,7 +701,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
 	 * we got into the SNAPBUILD_FULL_SNAPSHOT state.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT &&
-		SnapBuildTxnIsRunning(builder, xid))
+		SnapBuildTxnIsRunning(builder, xid, NULL))
 		return false;
 
 	/*
@@ -776,7 +777,7 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
  * only exist after we freshly started from an < CONSISTENT snapshot.
  */
 static bool
-SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid)
+SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid, int *off)
 {
 	Assert(builder->state < SNAPBUILD_CONSISTENT);
 	Assert(TransactionIdIsNormal(builder->running.xmin));
@@ -793,8 +794,17 @@ SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid)
 		if (search != NULL)
 		{
 			Assert(*search == xid);
+
+			if (off)
+			{
+				*off = search - builder->running.xip;
+				Assert(builder->running.xip[*off] == xid);
+			}
+
 			return true;
 		}
+
+
 	}
 
 	return false;
@@ -928,6 +938,8 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 static void
 SnapBuildEndTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
+	int off;
+
 	if (builder->state == SNAPBUILD_CONSISTENT)
 		return;
 
@@ -938,10 +950,13 @@ SnapBuildEndTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 	 * subxids and since they end at the same time it's sufficient to deal
 	 * with them here.
 	 */
-	if (SnapBuildTxnIsRunning(builder, xid))
+	if (SnapBuildTxnIsRunning(builder, xid, &off))
 	{
 		Assert(builder->running.xcnt > 0);
 
+		Assert(builder->running.xip_running[off]);
+		builder->running.xip_running[off] = false;
+
 		if (!--builder->running.xcnt)
 		{
 			/*
@@ -1250,9 +1265,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) or c).
+	 *	  state while waiting for b) or c).
 	 *
-	 * b) Wait for all toplevel transactions that were running to end. We
+	 * b) This (in a previous run) or another decoding slot serialized a
+	 *	  snapshot to disk that we can use.  Can't use this method for the
+	 *	  initial snapshot when slot is being created and needs full snapshot
+	 *	  for export or direct use, as that snapshot will only contain catalog
+	 *	  modifying transactions.
+	 *
+	 * c) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
@@ -1264,11 +1285,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
 	 *	  at all.
 	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.  Can't use this method for the
-	 *	  initial snapshot when slot is being created and needs full snapshot
-	 *	  for export or direct use, as that snapshot will only contain catalog
-	 *	  modifying transactions.
+	 *    Unfortunately there's a race condition around LogStandbySnapshot(),
+	 *    where transactions might have logged their commit record, before
+	 *    xl_running_xacts itself is logged. In that case the decoding logic
+	 *    would have missed that fact.  Thus
+	 *
+	 * d) xl_running_xacts shows us that transaction(s) assumed to be still
+	 *    running have actually already finished.  Adjust their status.
 	 * ---
 	 */
 
@@ -1323,16 +1346,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state and not building full snapshot */
+	/* b) valid on disk state and not building full snapshot */
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
 		/* there won't be any state to cleanup */
 		return false;
 	}
-
 	/*
-	 * b) first encounter of a useable xl_running_xacts record. If we had
+	 * c) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
 	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
 	 * get called).
@@ -1367,6 +1389,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 							   builder->running.xcnt * sizeof(TransactionId));
 		memcpy(builder->running.xip, running->xids,
 			   builder->running.xcnt * sizeof(TransactionId));
+		builder->running.xip_running =
+			MemoryContextAlloc(builder->context,
+							   builder->running.xcnt * sizeof(bool));
+		memset(builder->running.xip_running, 1,
+			   builder->running.xcnt * sizeof(bool));
 
 		/* sort so we can do a binary search */
 		qsort(builder->running.xip, builder->running.xcnt,
@@ -1414,13 +1441,44 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		/* nothing could have built up so far, so don't perform cleanup */
 		return false;
 	}
+	/* d) already tracking running xids, check whether any were missed */
+	else
+	{
+		size_t xcnt = running->xcnt;
+		TransactionId *xip =
+			MemoryContextAlloc(builder->context, xcnt * sizeof(TransactionId));
+		int i;
 
-	/*
-	 * We already started to track running xacts and need to wait for all
-	 * in-progress ones to finish. We fall through to the normal processing of
-	 * records so incremental cleanup can be performed.
-	 */
-	return true;
+		memcpy(xip, running->xids, xcnt * sizeof(TransactionId));
+
+		/* sort so we can do a binary search */
+		qsort(xip, xcnt, sizeof(TransactionId), xidComparator);
+
+		/*
+		 * Mark all transactions as finished that we assumed were running, but
+		 * actually aren't according to the xl_running_xacts record.
+		 */
+		for (i = 0; i < builder->running.xcnt_space; i++)
+		{
+			TransactionId still_running = builder->running.xip[i];
+			void *test;
+
+			if (!builder->running.xip_running[i])
+				continue;
+
+			test = bsearch(&still_running, xip, xcnt,
+						   sizeof(TransactionId), xidComparator);
+			if (!test)
+				SnapBuildEndTxn(builder, lsn, still_running);
+		}
+
+		/*
+		 * We already started to track running xacts and need to wait for all
+		 * in-progress ones to finish. We fall through to the normal processing of
+		 * records so incremental cleanup can be performed.
+		 */
+		return true;
+	}
 }
 
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8a71536791..de3ae92dd7 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2060,12 +2060,13 @@ GetRunningTransactionData(void)
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
 	CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
 
+	/* We don't release XidGenLock here, the caller is responsible for that */
+	LWLockRelease(ProcArrayLock);
+
 	Assert(TransactionIdIsValid(CurrentRunningXacts->nextXid));
 	Assert(TransactionIdIsValid(CurrentRunningXacts->oldestRunningXid));
 	Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));
 
-	/* We don't release the locks here, the caller is responsible for that */
-
 	return CurrentRunningXacts;
 }
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 8e57f933ca..ddb279e274 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -929,27 +929,8 @@ LogStandbySnapshot(void)
 	 */
 	running = GetRunningTransactionData();
 
-	/*
-	 * GetRunningTransactionData() acquired ProcArrayLock, we must release it.
-	 * For Hot Standby this can be done before inserting the WAL record
-	 * because ProcArrayApplyRecoveryInfo() rechecks the commit status using
-	 * the clog. For logical decoding, though, the lock can't be released
-	 * early because the clog might be "in the future" from the POV of the
-	 * historic snapshot. This would allow for situations where we're waiting
-	 * for the end of a transaction listed in the xl_running_xacts record
-	 * which, according to the WAL, has committed before the xl_running_xacts
-	 * record. Fortunately this routine isn't executed frequently, and it's
-	 * only a shared lock.
-	 */
-	if (wal_level < WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	recptr = LogCurrentRunningXacts(running);
 
-	/* Release lock if we kept it longer ... */
-	if (wal_level >= WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
-- 
2.12.0.264.gd6db3f2165.dirty

#46Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#45)
Re: snapbuild woes

On 01/05/17 03:35, Andres Freund wrote:

On 2017-04-30 17:59:21 -0700, Andres Freund wrote:

ISTM, we need a xip_status array in SnapBuild->running. Then, whenever
a xl_running_xacts is encountered while builder->running.xcnt > 0, loop
for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but
the xid is not xl_running_xacts? Does that make sense?

A hasty implementation, untested besides check-world, of that approach
is attached. I'm going out for dinner now, will subject it to mean
things afterwards.

Needs more testing, comment and commit message policing. But I think
the idea is sound?

I agree with adding running, I think that's good thing even for the per
transaction tracking and snapshot exports - we could use the newly added
field to get rid of the issue we have with 'snapshot too large' when
there were many aborted transactions while we waited for running ones to
finish. Because so far only tracked committed transactions, any aborted
one would show as running inside the exported snapshot so it can grow
over maximum number of backends and can't be exported anymore. So +1 for
that part.

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Craig Ringer
craig@2ndquadrant.com
In reply to: Petr Jelinek (#46)
Re: snapbuild woes

On 1 May 2017 at 09:54, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

Due to the race where LogStandbySnapshot() collects running-xacts info
while a concurrent xact commits, such that the xl_xact_commit appears
before the xl_running_xacts, but the xl_running_xacts still has the
commited xact listed as running, right? Because we update PGXACT only
after we write the commit to WAL, so there's a window where an xact is
committed in WAL but not shown as committed in shmem.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Craig Ringer (#47)
Re: snapbuild woes

On 01/05/17 04:29, Craig Ringer wrote:

On 1 May 2017 at 09:54, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

Due to the race where LogStandbySnapshot() collects running-xacts info
while a concurrent xact commits, such that the xl_xact_commit appears
before the xl_running_xacts, but the xl_running_xacts still has the
commited xact listed as running, right? Because we update PGXACT only
after we write the commit to WAL, so there's a window where an xact is
committed in WAL but not shown as committed in shmem.

Yes that's what the patch at hand tries to fix, but Andres approached it
from too simplistic standpoint as we don't only care about the exported
snapshot but whatever catalog snapshots we made for the transactions we
already track, unless I am missing something.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#46)
Re: snapbuild woes

On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:

I agree with adding running, I think that's good thing even for the per
transaction tracking and snapshot exports - we could use the newly added
field to get rid of the issue we have with 'snapshot too large' when
there were many aborted transactions while we waited for running ones to
finish.

I'm not sure of that - what I was proposing would only track this for
the ->running substructure. How'd that help?

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

I'm afraid you're right. But I think this is even more complicated: The
argument in your version that this can only happen once, seems to also
be holey: Just imagine a pg_usleep(3000 * 1000000) right before
ProcArrayEndTransaction() and enjoy the picture.

Wonder if we should just (re-)add a stage between SNAPBUILD_START and
SNAPBUILD_FULL_SNAPSHOT. Enter SNAPBUILD_BUILD_INITIAL_SNAPSHOT at the
first xl_running_xacts, wait for all transactions to end with my
approach, while populating SnapBuild->committed, only then start
collecting changes for transactions (i.e. return true from
SnapBuildProcessChange()), return true once all xacts have finished
again. That'd presumably be a bit easier to understand, more robust -
and slower.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#49)
Re: snapbuild woes

On 01/05/17 10:03, Andres Freund wrote:

On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:

I agree with adding running, I think that's good thing even for the per
transaction tracking and snapshot exports - we could use the newly added
field to get rid of the issue we have with 'snapshot too large' when
there were many aborted transactions while we waited for running ones to
finish.

I'm not sure of that - what I was proposing would only track this for
the ->running substructure. How'd that help?

Well not as is, but it's a building block for it.

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

I'm afraid you're right. But I think this is even more complicated: The
argument in your version that this can only happen once, seems to also
be holey: Just imagine a pg_usleep(3000 * 1000000) right before
ProcArrayEndTransaction() and enjoy the picture.

Well yes, transaction can in theory have written commit/abort xlog
record and stayed in proc for more than single xl_running_xacts write.
But then the condition which we test that the new xl_running_xacts has
bigger xmin than the previously tracked one's xmax would not be
satisfied and we would not enter the relevant code path yet. So I think
we should not be able to get any xids we didn't see. But we have to
restart tracking from beginning (after first checking if we didn't
already see anything that the xl_running_xacts considers as running),
that's what my code did.

Wonder if we should just (re-)add a stage between SNAPBUILD_START and
SNAPBUILD_FULL_SNAPSHOT. Enter SNAPBUILD_BUILD_INITIAL_SNAPSHOT at the
first xl_running_xacts, wait for all transactions to end with my
approach, while populating SnapBuild->committed, only then start
collecting changes for transactions (i.e. return true from
SnapBuildProcessChange()), return true once all xacts have finished
again. That'd presumably be a bit easier to understand, more robust -
and slower.

That would also work, but per above, I don't understand why it's needed.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#43)
Re: snapbuild woes

Noah Misch <noah@leadboat.com> writes:

On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:

... I was wondering about adding
a loop that simply runs for like 30s and then quits or such, but who
knows.

If the probabilistic test catches the bug even 5% of the time in typical
configurations, the buildfarm will rapidly identify any regression. I'd
choose a 7s test that detects the bug 5% of the time over a 30s test that
detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl
for a probabilistic bug, I sized that test to finish in 1s and catch its bug
half the time. In its case, only two buildfarm members were able to
demonstrate the original bug, so 5% detection would have been too low.)

30sec is kind of a big lump from a buildfarm standpoint, especially if
you mean "it runs for 30s on my honkin' fast workstation". I'm fine
with individual tests that run for ~ 1sec.

(This is top-of-mind for me right now because I've been looking around
for ways to speed up the regression tests.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Tom Lane (#51)
Re: snapbuild woes

On 05/01/2017 08:46 AM, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:

... I was wondering about adding
a loop that simply runs for like 30s and then quits or such, but who
knows.

If the probabilistic test catches the bug even 5% of the time in typical
configurations, the buildfarm will rapidly identify any regression. I'd
choose a 7s test that detects the bug 5% of the time over a 30s test that
detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl
for a probabilistic bug, I sized that test to finish in 1s and catch its bug
half the time. In its case, only two buildfarm members were able to
demonstrate the original bug, so 5% detection would have been too low.)

30sec is kind of a big lump from a buildfarm standpoint, especially if
you mean "it runs for 30s on my honkin' fast workstation". I'm fine
with individual tests that run for ~ 1sec.

(This is top-of-mind for me right now because I've been looking around
for ways to speed up the regression tests.)

Yes, me too. We're getting a bit lazy about that - see thread nearby
that will let us avoid unnecessary temp installs among other things.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#51)
Re: snapbuild woes

On 2017-05-01 08:46:47 -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:

... I was wondering about adding
a loop that simply runs for like 30s and then quits or such, but who
knows.

If the probabilistic test catches the bug even 5% of the time in typical
configurations, the buildfarm will rapidly identify any regression. I'd
choose a 7s test that detects the bug 5% of the time over a 30s test that
detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl
for a probabilistic bug, I sized that test to finish in 1s and catch its bug
half the time. In its case, only two buildfarm members were able to
demonstrate the original bug, so 5% detection would have been too low.)

30sec is kind of a big lump from a buildfarm standpoint, especially if
you mean "it runs for 30s on my honkin' fast workstation". I'm fine
with individual tests that run for ~ 1sec.

I was more thinking of pgench -T$XX, rather than constant number of
iterations. I currently can reproduce the issues within like 3-4
minutes, so 5s is probably not quite sufficient to get decent coverage.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#53)
Re: snapbuild woes

Andres Freund <andres@anarazel.de> writes:

On 2017-05-01 08:46:47 -0400, Tom Lane wrote:

30sec is kind of a big lump from a buildfarm standpoint, especially if
you mean "it runs for 30s on my honkin' fast workstation". I'm fine
with individual tests that run for ~ 1sec.

I was more thinking of pgench -T$XX, rather than constant number of
iterations. I currently can reproduce the issues within like 3-4
minutes, so 5s is probably not quite sufficient to get decent coverage.

Adding a five-minute pgbench run to the buildfarm sequence is definitely
going to get you ridden out of town on a rail. But quite aside from the
question of whether we can afford the cycles, it seems like the wrong
approach. IMO the buildfarm is mainly for verifying portability, not for
trying to prove that race-like conditions don't exist. In most situations
we're going out of our way to ensure reproduceability of tests we add to
the buildfarm sequence; but it seems like this is looking for
irreproducible results.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#54)
Re: snapbuild woes

On 2017-05-01 12:32:07 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-05-01 08:46:47 -0400, Tom Lane wrote:

30sec is kind of a big lump from a buildfarm standpoint, especially if
you mean "it runs for 30s on my honkin' fast workstation". I'm fine
with individual tests that run for ~ 1sec.

I was more thinking of pgench -T$XX, rather than constant number of
iterations. I currently can reproduce the issues within like 3-4
minutes, so 5s is probably not quite sufficient to get decent coverage.

Adding a five-minute pgbench run to the buildfarm sequence is definitely
going to get you ridden out of town on a rail.

Right - that was referring to Noah's comment upthread:

On 2017-04-29 14:42:01 -0700, Noah Misch wrote:

If the probabilistic test catches the bug even 5% of the time in typical
configurations, the buildfarm will rapidly identify any regression. I'd
choose a 7s test that detects the bug 5% of the time over a 30s test that
detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl
for a probabilistic bug, I sized that test to finish in 1s and catch its bug
half the time. In its case, only two buildfarm members were able to
demonstrate the original bug, so 5% detection would have been too low.)

and I suspect that you'd not find these within 5s within sufficient
time, because the detection rate would be too low.

But quite aside from the question of whether we can afford the cycles,
it seems like the wrong approach. IMO the buildfarm is mainly for
verifying portability, not for trying to prove that race-like
conditions don't exist. In most situations we're going out of our way
to ensure reproduceability of tests we add to the buildfarm sequence;
but it seems like this is looking for irreproducible results.

Yea, I wondered about that upthread as well. But the tests are quite
useful nonetheless. Wonder about adding them simply as a separate
target.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#55)
Re: snapbuild woes

Andres Freund <andres@anarazel.de> writes:

On 2017-05-01 12:32:07 -0400, Tom Lane wrote:

But quite aside from the question of whether we can afford the cycles,
it seems like the wrong approach. IMO the buildfarm is mainly for
verifying portability, not for trying to prove that race-like
conditions don't exist. In most situations we're going out of our way
to ensure reproduceability of tests we add to the buildfarm sequence;
but it seems like this is looking for irreproducible results.

Yea, I wondered about that upthread as well. But the tests are quite
useful nonetheless. Wonder about adding them simply as a separate
target.

I have no objection to adding more tests as a non-default target.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#50)
Re: snapbuild woes

On 2017-05-01 11:09:44 +0200, Petr Jelinek wrote:

On 01/05/17 10:03, Andres Freund wrote:

On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

I'm afraid you're right. But I think this is even more complicated: The
argument in your version that this can only happen once, seems to also
be holey: Just imagine a pg_usleep(3000 * 1000000) right before
ProcArrayEndTransaction() and enjoy the picture.

Well yes, transaction can in theory have written commit/abort xlog
record and stayed in proc for more than single xl_running_xacts write.
But then the condition which we test that the new xl_running_xacts has
bigger xmin than the previously tracked one's xmax would not be
satisfied and we would not enter the relevant code path yet. So I think
we should not be able to get any xids we didn't see. But we have to
restart tracking from beginning (after first checking if we didn't
already see anything that the xl_running_xacts considers as running),
that's what my code did.

But to get that correct, we'd have to not only track ->committed, but
also somehow maintain ->aborted, and not just for the transactions in
the original set of running transactions. That'd be fairly complicated
and large. The reason I was trying - and it's definitely not correct as
I had proposed - to use the original running_xacts record is that that
only required tracking as many transaction statuses as in the first
xl_running_xacts. Am I missing something?

The probabilistic tests catch the issues here fairly quickly, btw, if
you run with synchronous_commit=on, while pgbench is running, because
the WAL flushes make this more likely. Runs this query:

SELECT account_count, teller_count, account_sum - teller_sum s
FROM
(
SELECT count(*) account_count, SUM(abalance) account_sum
FROM pgbench_accounts
) a,
(
SELECT count(*) teller_count, SUM(tbalance) teller_sum
FROM pgbench_tellers
) t

which, for my scale, should always return:
┌─────────┬─────┬───┐
│ a │ t │ s │
├─────────┼─────┼───┤
│ 2000000 │ 200 │ 0 │
└─────────┴─────┴───┘
but with my POC patch occasionally returns things like:
┌─────────┬─────┬───────┐
│ a │ t │ s │
├─────────┼─────┼───────┤
│ 2000000 │ 212 │ 37358 │
└─────────┴─────┴───────┘

which obviously shouldn't be the case.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tom Lane (#56)
Re: snapbuild woes

On 5/1/17 13:02, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-05-01 12:32:07 -0400, Tom Lane wrote:

But quite aside from the question of whether we can afford the cycles,
it seems like the wrong approach. IMO the buildfarm is mainly for
verifying portability, not for trying to prove that race-like
conditions don't exist. In most situations we're going out of our way
to ensure reproduceability of tests we add to the buildfarm sequence;
but it seems like this is looking for irreproducible results.

Yea, I wondered about that upthread as well. But the tests are quite
useful nonetheless. Wonder about adding them simply as a separate
target.

I have no objection to adding more tests as a non-default target.

Well, the problem with nondefault targets is that they are hard to find
if you don't know them, and then they will rot.

Sure, we need a way to distinguish different classes of tests, but lets
think about the bigger scheme, too. Ideas welcome.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#54)
Re: snapbuild woes

On Mon, May 01, 2017 at 12:32:07PM -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-05-01 08:46:47 -0400, Tom Lane wrote:

30sec is kind of a big lump from a buildfarm standpoint, especially if
you mean "it runs for 30s on my honkin' fast workstation". I'm fine
with individual tests that run for ~ 1sec.

I was more thinking of pgench -T$XX, rather than constant number of
iterations. I currently can reproduce the issues within like 3-4
minutes, so 5s is probably not quite sufficient to get decent coverage.

You might hit the race faster by adding a dedicated stress test function to
regress.c.

IMO the buildfarm is mainly for verifying portability, not for
trying to prove that race-like conditions don't exist.

Perhaps so, but it has excelled at both tasks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#57)
Re: snapbuild woes

On 01/05/17 21:14, Andres Freund wrote:

On 2017-05-01 11:09:44 +0200, Petr Jelinek wrote:

On 01/05/17 10:03, Andres Freund wrote:

On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

I'm afraid you're right. But I think this is even more complicated: The
argument in your version that this can only happen once, seems to also
be holey: Just imagine a pg_usleep(3000 * 1000000) right before
ProcArrayEndTransaction() and enjoy the picture.

Well yes, transaction can in theory have written commit/abort xlog
record and stayed in proc for more than single xl_running_xacts write.
But then the condition which we test that the new xl_running_xacts has
bigger xmin than the previously tracked one's xmax would not be
satisfied and we would not enter the relevant code path yet. So I think
we should not be able to get any xids we didn't see. But we have to
restart tracking from beginning (after first checking if we didn't
already see anything that the xl_running_xacts considers as running),
that's what my code did.

But to get that correct, we'd have to not only track ->committed, but
also somehow maintain ->aborted, and not just for the transactions in
the original set of running transactions. That'd be fairly complicated
and large. The reason I was trying - and it's definitely not correct as
I had proposed - to use the original running_xacts record is that that
only required tracking as many transaction statuses as in the first
xl_running_xacts. Am I missing something?

Aah, now I understand we talked about slightly different things, I
considered the running thing to be first step towards tracking aborted
txes everywhere. I am not sure if it's complicated, it would be exactly
the same as committed tracking, except we'd do it only before we reach
SNAPBUILD_CONSISTENT. It would be definitely larger patch I agree, but I
can give it at try.

If you think that adding the SNAPBUILD_BUILD_INITIAL_SNAPSHOT would be
less invasive/smaller patch I am okay with doing that for PG10. I think
we'll have to revisit tracking of aborted transactions in PG11 then
though because of the 'snapshot too large' issue when exporting, at
least I don't see any other way to fix that.

The probabilistic tests catch the issues here fairly quickly, btw, if
you run with synchronous_commit=on, while pgbench is running, because
the WAL flushes make this more likely. Runs this query:

SELECT account_count, teller_count, account_sum - teller_sum s
FROM
(
SELECT count(*) account_count, SUM(abalance) account_sum
FROM pgbench_accounts
) a,
(
SELECT count(*) teller_count, SUM(tbalance) teller_sum
FROM pgbench_tellers
) t

which, for my scale, should always return:
┌─────────┬─────┬───┐
│ a │ t │ s │
├─────────┼─────┼───┤
│ 2000000 │ 200 │ 0 │
└─────────┴─────┴───┘
but with my POC patch occasionally returns things like:
┌─────────┬─────┬───────┐
│ a │ t │ s │
├─────────┼─────┼───────┤
│ 2000000 │ 212 │ 37358 │
└─────────┴─────┴───────┘

which obviously shouldn't be the case.

Very nice (the test, not the failures ;)) !

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#42)
Re: snapbuild woes

On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote:

On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I've since the previous update reviewed Petr's patch, which he

since has

updated over the weekend. I'll do another round tomorrow, and will

see

how it looks. I think we might need some more tests for this to be
committable, so it might not become committable tomorrow. I hope

we'll

have something in tree by end of this week, if not I'll send an

update.

I was less productive this week than I'd hoped, and creating a

testsuite

was more work than I'd anticipated, so I'm slightly lagging behind.

I

hope to have a patchset tomorrow, aiming to commit something
Monday/Tuesday.

This PostgreSQL 10 open item is past due for your status update.
Kindly send
a status update within 24 hours, and include a date for your subsequent
status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I committed part of the series today, plan to continue doing so over the next few days. Changes require careful review & testing, this is easy to get wrong...

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.

Also, this open item has been alive for three weeks, well above guideline. I
understand it's a tricky bug, but I'm worried this isn't on track to end.
What is missing to make it end?

Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Noah Misch (#61)
Re: snapbuild woes

On 04/05/17 07:45, Noah Misch wrote:

On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote:

On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I've since the previous update reviewed Petr's patch, which he

since has

updated over the weekend. I'll do another round tomorrow, and will

see

how it looks. I think we might need some more tests for this to be
committable, so it might not become committable tomorrow. I hope

we'll

have something in tree by end of this week, if not I'll send an

update.

I was less productive this week than I'd hoped, and creating a

testsuite

was more work than I'd anticipated, so I'm slightly lagging behind.

I

hope to have a patchset tomorrow, aiming to commit something
Monday/Tuesday.

This PostgreSQL 10 open item is past due for your status update.
Kindly send
a status update within 24 hours, and include a date for your subsequent
status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I committed part of the series today, plan to continue doing so over the next few days. Changes require careful review & testing, this is easy to get wrong...

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.

Also, this open item has been alive for three weeks, well above guideline. I
understand it's a tricky bug, but I'm worried this isn't on track to end.
What is missing to make it end?

It's tricky 5 bugs, and they are quite sensitive on rare
timing/concurrency events.

First two are fixed, we can live with 5th being done later (as it's not
correctness fix, but very bad performance one).

We haven't figured way for fixing the 4th one that we all agree is good
solution (everything proposed so far still had bugs).

I am not quite sure what happened to the 3rd one.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#60)
Re: snapbuild woes

Hi,

On 2017-05-02 08:55:53 +0200, Petr Jelinek wrote:

Aah, now I understand we talked about slightly different things, I
considered the running thing to be first step towards tracking aborted
txes everywhere.
I think
we'll have to revisit tracking of aborted transactions in PG11 then
though because of the 'snapshot too large' issue when exporting, at
least I don't see any other way to fix that.

FWIW, that seems unnecessary - we can just check for that using the
clog. Should be very simple to check for aborted xacts when exporting
the snapshot (like 2 lines + comments). That should address your
concern, right?

If you think that adding the SNAPBUILD_BUILD_INITIAL_SNAPSHOT would be
less invasive/smaller patch I am okay with doing that for PG10.

Attached is a prototype patch for that.

What I decided is that essentially tracking the running xacts is too
unrealiable due to the race, so I decided that just relying on
oldestRunningXid and nextXid - which are solely in the procArray and
thus racefree - is better.

It's not perfect yet, primarily because we'd need to take a bit more
care about being ABI compatible for older releases, and because we'd
probably have to trigger LogStandbySnapshot() a bit more frequently
(presumably while waiting for WAL). The change means we'll have to wait
a bit longer for slot creation, but it's considerably simpler / more
robust.

Could you have a look?

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#63)
2 attachment(s)
Re: snapbuild woes

On 2017-05-04 17:00:04 -0700, Andres Freund wrote:

Attached is a prototype patch for that.

Oops.

Andres

Attachments:

0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patchtext/x-patch; charset=us-asciiDownload
From b6eb46e376e40f3e2e9a55d16b1b37b27904564b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 4 May 2017 16:40:52 -0700
Subject: [PATCH 1/2] WIP: Fix off-by-one around GetLastImportantRecPtr.

---
 src/backend/postmaster/bgwriter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index dcb4cf249c..d409d977c0 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -325,10 +325,11 @@ BackgroundWriterMain(void)
 
 			/*
 			 * Only log if enough time has passed and interesting records have
-			 * been inserted since the last snapshot.
+			 * been inserted since the last snapshot (it's <= because
+			 * last_snapshot_lsn points at the end+1 of the record).
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn < GetLastImportantRecPtr())
+				last_snapshot_lsn <= GetLastImportantRecPtr())
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
 				last_snapshot_ts = now;
-- 
2.12.0.264.gd6db3f2165.dirty

0002-WIP-Possibly-more-robust-snapbuild-approach.patchtext/x-patch; charset=us-asciiDownload
From 7ed2aeb832029f5602566a665b3f4dbe8baedfcd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 4 May 2017 16:48:00 -0700
Subject: [PATCH 2/2] WIP: Possibly more robust snapbuild approach.

---
 contrib/test_decoding/expected/ondisk_startup.out |  15 +-
 contrib/test_decoding/specs/ondisk_startup.spec   |   8 +-
 src/backend/replication/logical/decode.c          |   3 -
 src/backend/replication/logical/snapbuild.c       | 386 +++++++++++-----------
 src/include/replication/snapbuild.h               |  25 +-
 5 files changed, 215 insertions(+), 222 deletions(-)

diff --git a/contrib/test_decoding/expected/ondisk_startup.out b/contrib/test_decoding/expected/ondisk_startup.out
index 65115c830a..c7b1f45b46 100644
--- a/contrib/test_decoding/expected/ondisk_startup.out
+++ b/contrib/test_decoding/expected/ondisk_startup.out
@@ -1,21 +1,30 @@
 Parsed test spec with 3 sessions
 
-starting permutation: s2txid s1init s3txid s2alter s2c s1insert s1checkpoint s1start s1insert s1alter s1insert s1start
-step s2txid: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL;
+starting permutation: s2b s2txid s1init s3b s3txid s2alter s2c s2b s2txid s3c s2c s1insert s1checkpoint s1start s1insert s1alter s1insert s1start
+step s2b: BEGIN;
+step s2txid: SELECT txid_current() IS NULL;
 ?column?       
 
 f              
 step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
-step s3txid: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL;
+step s3b: BEGIN;
+step s3txid: SELECT txid_current() IS NULL;
 ?column?       
 
 f              
 step s2alter: ALTER TABLE do_write ADD COLUMN addedbys2 int;
 step s2c: COMMIT;
+step s2b: BEGIN;
+step s2txid: SELECT txid_current() IS NULL;
+?column?       
+
+f              
+step s3c: COMMIT;
 step s1init: <... completed>
 ?column?       
 
 init           
+step s2c: COMMIT;
 step s1insert: INSERT INTO do_write DEFAULT VALUES;
 step s1checkpoint: CHECKPOINT;
 step s1start: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false');
diff --git a/contrib/test_decoding/specs/ondisk_startup.spec b/contrib/test_decoding/specs/ondisk_startup.spec
index 8223705639..12c57a813d 100644
--- a/contrib/test_decoding/specs/ondisk_startup.spec
+++ b/contrib/test_decoding/specs/ondisk_startup.spec
@@ -24,7 +24,8 @@ step "s1alter" { ALTER TABLE do_write ADD COLUMN addedbys1 int; }
 session "s2"
 setup { SET synchronous_commit=on; }
 
-step "s2txid" { BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL; }
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT txid_current() IS NULL; }
 step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }
 step "s2c" { COMMIT; }
 
@@ -32,7 +33,8 @@ step "s2c" { COMMIT; }
 session "s3"
 setup { SET synchronous_commit=on; }
 
-step "s3txid" { BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL; }
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT txid_current() IS NULL; }
 step "s3c" { COMMIT; }
 
 # Force usage of ondisk snapshot by starting and not finishing a
@@ -40,4 +42,4 @@ step "s3c" { COMMIT; }
 # reached. In combination with a checkpoint forcing a snapshot to be
 # written and a new restart point computed that'll lead to the usage
 # of the snapshot.
-permutation "s2txid" "s1init" "s3txid" "s2alter" "s2c" "s1insert" "s1checkpoint" "s1start" "s1insert" "s1alter" "s1insert" "s1start"
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s2b" "s2txid" "s3c" "s2c" "s1insert" "s1checkpoint" "s1start" "s1insert" "s1alter" "s1insert" "s1start"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26099..68825ef598 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -622,9 +622,6 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 {
 	int			i;
 
-	SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
-					  parsed->nsubxacts, parsed->subxacts);
-
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 068d214fa1..1176d2059b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -56,23 +56,34 @@
  *
  *
  * The snapbuild machinery is starting up in several stages, as illustrated
- * by the following graph:
+ * by the following graph describing the SnapBuild->state transitions:
+ *
  *		   +-------------------------+
- *	  +----|SNAPBUILD_START			 |-------------+
+ *	  +----|         START			 |-------------+
  *	  |    +-------------------------+			   |
  *	  |					|						   |
  *	  |					|						   |
- *	  |		running_xacts with running xacts	   |
+ *	  |		   running_xacts #1					   |
  *	  |					|						   |
  *	  |					|						   |
  *	  |					v						   |
  *	  |    +-------------------------+			   v
- *	  |    |SNAPBUILD_FULL_SNAPSHOT  |------------>|
+ *	  |    |   BUILDING_SNAPSHOT     |------------>|
  *	  |    +-------------------------+			   |
+ *	  |					|						   |
+ *	  |					|						   |
+ *	  |	running_xacts #2, xacts from #1 finished   |
+ *	  |					|						   |
+ *	  |					|						   |
+ *	  |					v						   |
+ *	  |    +-------------------------+			   v
+ *	  |    |       FULL_SNAPSHOT     |------------>|
+ *	  |    +-------------------------+			   |
+ *	  |					|						   |
  * running_xacts		|					   saved snapshot
  * with zero xacts		|				  at running_xacts's lsn
  *	  |					|						   |
- *	  |		all running toplevel TXNs finished	   |
+ *	  |	running_xacts with xacts from #2 finished  |
  *	  |					|						   |
  *	  |					v						   |
  *	  |    +-------------------------+			   |
@@ -83,7 +94,7 @@
  * record is read that is sufficiently new (above the safe xmin horizon),
  * there's a state transition. If there were no running xacts when the
  * running_xacts record was generated, we'll directly go into CONSISTENT
- * state, otherwise we'll switch to the FULL_SNAPSHOT state. Having a full
+ * state, otherwise we'll switch to the BUILDING_SNAPSHOT state. Having a full
  * snapshot means that all transactions that start henceforth can be decoded
  * in their entirety, but transactions that started previously can't. In
  * FULL_SNAPSHOT we'll switch into CONSISTENT once all those previously
@@ -184,6 +195,14 @@ struct SnapBuild
 	ReorderBuffer *reorder;
 
 	/*
+	 * When can the next state be reached?
+	 *
+	 * FIXME: More accurate name, possibly split into two?
+	 * FIXME: need to be moved into ->running.xmin or such for ABI compat.
+	 */
+	TransactionId started_collection_at;
+
+	/*
 	 * Information about initially running transactions
 	 *
 	 * When we start building a snapshot there already may be transactions in
@@ -203,7 +222,7 @@ struct SnapBuild
 		size_t		xcnt;		/* number of used xip entries */
 		size_t		xcnt_space; /* allocated size of xip */
 		TransactionId *xip;		/* running xacts array, xidComparator-sorted */
-	}			running;
+	}			running_old;
 
 	/*
 	 * Array of transactions which could have catalog changes that committed
@@ -249,12 +268,6 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* transaction state manipulation functions */
-static void SnapBuildEndTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
-
-/* ->running manipulation */
-static bool SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid);
-
 /* ->committed manipulation */
 static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
 
@@ -269,6 +282,7 @@ static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr
 
 /* xlog reading helper functions for SnapBuildProcessRecord */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
+static void SnapBuildWaitSnapshot(xl_running_xacts *running);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
@@ -700,7 +714,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
 	 * we got into the SNAPBUILD_FULL_SNAPSHOT state.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT &&
-		SnapBuildTxnIsRunning(builder, xid))
+		TransactionIdPrecedes(xid, builder->started_collection_at))
 		return false;
 
 	/*
@@ -769,38 +783,6 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Check whether `xid` is currently 'running'.
- *
- * Running transactions in our parlance are transactions which we didn't
- * observe from the start so we can't properly decode their contents. They
- * only exist after we freshly started from an < CONSISTENT snapshot.
- */
-static bool
-SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid)
-{
-	Assert(builder->state < SNAPBUILD_CONSISTENT);
-	Assert(TransactionIdIsNormal(builder->running.xmin));
-	Assert(TransactionIdIsNormal(builder->running.xmax));
-
-	if (builder->running.xcnt &&
-		NormalTransactionIdFollows(xid, builder->running.xmin) &&
-		NormalTransactionIdPrecedes(xid, builder->running.xmax))
-	{
-		TransactionId *search =
-		bsearch(&xid, builder->running.xip, builder->running.xcnt_space,
-				sizeof(TransactionId), xidComparator);
-
-		if (search != NULL)
-		{
-			Assert(*search == xid);
-			return true;
-		}
-	}
-
-	return false;
-}
-
-/*
  * Add a new Snapshot to all transactions we're decoding that currently are
  * in-progress so they can see new catalog contents made by the transaction
  * that just committed. This is necessary because those in-progress
@@ -922,63 +904,6 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 }
 
 /*
- * Common logic for SnapBuildAbortTxn and SnapBuildCommitTxn dealing with
- * keeping track of the amount of running transactions.
- */
-static void
-SnapBuildEndTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
-{
-	if (builder->state == SNAPBUILD_CONSISTENT)
-		return;
-
-	/*
-	 * NB: This handles subtransactions correctly even if we started from
-	 * suboverflowed xl_running_xacts because we only keep track of toplevel
-	 * transactions. Since the latter are always allocated before their
-	 * subxids and since they end at the same time it's sufficient to deal
-	 * with them here.
-	 */
-	if (SnapBuildTxnIsRunning(builder, xid))
-	{
-		Assert(builder->running.xcnt > 0);
-
-		if (!--builder->running.xcnt)
-		{
-			/*
-			 * None of the originally running transaction is running anymore,
-			 * so our incrementally built snapshot now is consistent.
-			 */
-			ereport(LOG,
-				  (errmsg("logical decoding found consistent point at %X/%X",
-						  (uint32) (lsn >> 32), (uint32) lsn),
-				   errdetail("Transaction ID %u finished; no more running transactions.",
-							 xid)));
-			builder->state = SNAPBUILD_CONSISTENT;
-		}
-	}
-}
-
-/*
- * Abort a transaction, throw away all state we kept.
- */
-void
-SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
-				  TransactionId xid,
-				  int nsubxacts, TransactionId *subxacts)
-{
-	int			i;
-
-	for (i = 0; i < nsubxacts; i++)
-	{
-		TransactionId subxid = subxacts[i];
-
-		SnapBuildEndTxn(builder, lsn, subxid);
-	}
-
-	SnapBuildEndTxn(builder, lsn, xid);
-}
-
-/*
  * Handle everything that needs to be done when a transaction commits
  */
 void
@@ -1022,11 +947,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		TransactionId subxid = subxacts[nxact];
 
 		/*
-		 * make sure txn is not tracked in running txn's anymore, switch state
-		 */
-		SnapBuildEndTxn(builder, lsn, subxid);
-
-		/*
 		 * If we're forcing timetravel we also need visibility information
 		 * about subtransaction, so keep track of subtransaction's state.
 		 */
@@ -1055,12 +975,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		}
 	}
 
-	/*
-	 * Make sure toplevel txn is not tracked in running txn's anymore, switch
-	 * state to consistent if possible.
-	 */
-	SnapBuildEndTxn(builder, lsn, xid);
-
 	if (forced_timetravel)
 	{
 		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
@@ -1250,9 +1164,45 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) or c).
+	 *	  state while waiting for c) or d), e).
 	 *
-	 * b) Wait for all toplevel transactions that were running to end. We
+	 * b) This (in a previous run) or another decoding slot serialized a
+	 *	  snapshot to disk that we can use.  Can't use this method for the
+	 *	  initial snapshot when slot is being created and needs full snapshot
+	 *	  for export or direct use, as that snapshot will only contain catalog
+	 *	  modifying transactions.
+	 *
+	 * c) First incrementally build a snapshot for catalog tuples
+	 *    (BUILDING_SNAPSHOT), that requires all, already in-progress,
+	 *    transactions to finish.  Every transaction starting after that
+	 *    (FULL_SNAPSHOT state), has enough information to be decoded.  But
+	 *    for older running transactions no viable snapshot exists yet, so
+	 *    CONSISTENT will only be reached once all of those have finished.
+	 *
+	 * c) In BUILDING_SNAPSHOT state (see d) ), and this xl_running_xacts'
+	 *    oldestRunningXid is >= than nextXid from when we switched to
+	 *    BUILDING_SNAPSHOT.  Switch to FULL_SNAPSHOT.
+	 *
+	 * d) In FULL_SNAPSHOT state (see d) ), and this xl_running_xacts'
+	 *    oldestRunningXid is >= than nextXid from when we switched to
+	 *    FULL_SNAPSHOT.   Switch to CONSISTENT.
+	 *
+	 * e) In START state, and a xl_running_xacts record with running xacts is
+	 *    encountered.  In that case, switch to BUILDING_SNAPSHOT state, and
+	 *    record xl_running_xacts->nextXid.  Once all running xacts have
+	 *    finished (i.e. they're all >= nextXid), we have a complete snapshot.
+	 *    It might look that we could use xl_running_xact's ->xids information
+	 *    to get there quicker, but that is problematic because transactions
+	 *    marked as running, might already have inserted their commit record -
+	 *    it's infeasible to change that with locking.
+
+	 *
+	 * d) In BUILDING_SNAPSHOT state (see c) ), and this xl_running_xacts'
+	 *    oldestRunningXid is newer than the
+	 *
+
+
+ Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
@@ -1264,11 +1214,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
 	 *	  at all.
 	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.  Can't use this method for the
-	 *	  initial snapshot when slot is being created and needs full snapshot
-	 *	  for export or direct use, as that snapshot will only contain catalog
-	 *	  modifying transactions.
+	 *    Unfortunately there's a race condition around LogStandbySnapshot(),
+	 *    where transactions might have logged their commit record, before
+	 *    xl_running_xacts itself is logged. In that case the decoding logic
+	 *    would have missed that fact.  Thus
+	 *
+	 * d) xl_running_xacts shows us that transaction(s) assumed to be still
+	 *    running have actually already finished.  Adjust their status.
 	 * ---
 	 */
 
@@ -1291,10 +1243,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	/*
 	 * a) No transaction were running, we can jump to consistent.
 	 *
+	 * This is not affected by races, because we can miss transaction commits,
+	 * but we can't miss transactions starting (XXX: Not true if we relax locking!).
+	 *
 	 * NB: We might have already started to incrementally assemble a snapshot,
 	 * so we need to be careful to deal with that.
 	 */
-	if (running->xcnt == 0)
+	if (running->oldestRunningXid == running->nextXid)
 	{
 		if (builder->start_decoding_at == InvalidXLogRecPtr ||
 			builder->start_decoding_at <= lsn)
@@ -1310,9 +1265,9 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		Assert(TransactionIdIsNormal(builder->xmax));
 
 		/* no transactions running now */
-		builder->running.xcnt = 0;
-		builder->running.xmin = InvalidTransactionId;
-		builder->running.xmax = InvalidTransactionId;
+		builder->running_old.xcnt = 0;
+		builder->running_old.xmin = InvalidTransactionId;
+		builder->running_old.xmax = InvalidTransactionId;
 
 		builder->state = SNAPBUILD_CONSISTENT;
 
@@ -1323,30 +1278,29 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state and not building full snapshot */
+	/* b) valid on disk state and not building full snapshot */
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
 		/* there won't be any state to cleanup */
 		return false;
 	}
-
 	/*
-	 * b) first encounter of a useable xl_running_xacts record. If we had
-	 * found one earlier we would either track running transactions (i.e.
-	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * c) transition from START to BUILDING_SNAPSHOT.
+	 *
+	 * In START state, and a xl_running_xacts record with running xacts is
+	 * encountered.  In that case, switch to BUILDING_SNAPSHOT state, and
+	 * record xl_running_xacts->nextXid.  Once all running xacts have finished
+	 * (i.e. they're all >= nextXid), we have a complete catalog snapshot.  It
+	 * might look that we could use xl_running_xact's ->xids information to
+	 * get there quicker, but that is problematic because transactions marked
+	 * as running, might already have inserted their commit record - it's
+	 * infeasible to change that with locking.
 	 */
-	else if (!builder->running.xcnt)
+	else if (builder->state == SNAPBUILD_START)
 	{
-		int			off;
-
-		/*
-		 * We only care about toplevel xids as those are the ones we
-		 * definitely see in the wal stream. As snapbuild.c tracks committed
-		 * instead of running transactions we don't need to know anything
-		 * about uncommitted subtransactions.
-		 */
+		builder->started_collection_at = running->nextXid;
+		builder->state = SNAPBUILD_BUILDING_SNAPSHOT;
 
 		/*
 		 * Start with an xmin/xmax that's correct for future, when all the
@@ -1360,59 +1314,59 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		Assert(TransactionIdIsNormal(builder->xmin));
 		Assert(TransactionIdIsNormal(builder->xmax));
 
-		builder->running.xcnt = running->xcnt;
-		builder->running.xcnt_space = running->xcnt;
-		builder->running.xip =
-			MemoryContextAlloc(builder->context,
-							   builder->running.xcnt * sizeof(TransactionId));
-		memcpy(builder->running.xip, running->xids,
-			   builder->running.xcnt * sizeof(TransactionId));
-
-		/* sort so we can do a binary search */
-		qsort(builder->running.xip, builder->running.xcnt,
-			  sizeof(TransactionId), xidComparator);
-
-		builder->running.xmin = builder->running.xip[0];
-		builder->running.xmax = builder->running.xip[running->xcnt - 1];
-
-		/* makes comparisons cheaper later */
-		TransactionIdRetreat(builder->running.xmin);
-		TransactionIdAdvance(builder->running.xmax);
-
-		builder->state = SNAPBUILD_FULL_SNAPSHOT;
-
 		ereport(LOG,
 			(errmsg("logical decoding found initial starting point at %X/%X",
 					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
+			 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
+					   running->xcnt, running->nextXid)));
 
-		/*
-		 * Iterate through all xids, wait for them to finish.
-		 *
-		 * This isn't required for the correctness of decoding, but to allow
-		 * isolationtester to notice that we're currently waiting for
-		 * something.
-		 */
-		for (off = 0; off < builder->running.xcnt; off++)
-		{
-			TransactionId xid = builder->running.xip[off];
+		SnapBuildWaitSnapshot(running);
+	}
+	/*
+	 * c) transition from BUILDING_SNAPSHOT to FULL_SNAPSHOT.
+	 *
+	 * In BUILDING_SNAPSHOT state, and this xl_running_xacts' oldestRunningXid
+	 * is >= than nextXid from when we switched to BUILDING_SNAPSHOT.  This
+	 * means all transactions starting afterwards have enough information to
+	 * be decoded.  Switch to FULL_SNAPSHOT.
+	 */
+	else if (builder->state == SNAPBUILD_BUILDING_SNAPSHOT &&
+			 TransactionIdPrecedesOrEquals(builder->started_collection_at,
+										   running->oldestRunningXid))
+	{
+		builder->state = SNAPBUILD_FULL_SNAPSHOT;
+		builder->started_collection_at = running->nextXid;
 
-			/*
-			 * Upper layers should prevent that we ever need to wait on
-			 * ourselves. Check anyway, since failing to do so would either
-			 * result in an endless wait or an Assert() failure.
-			 */
-			if (TransactionIdIsCurrentTransactionId(xid))
-				elog(ERROR, "waiting for ourselves");
+		SnapBuildWaitSnapshot(running);
 
-			XactLockTableWait(xid, NULL, NULL, XLTW_None);
-		}
-
-		/* nothing could have built up so far, so don't perform cleanup */
-		return false;
+		ereport(LOG,
+				(errmsg("logical decoding found initial consistent point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
+						   running->xcnt, running->nextXid)));
+	}
+	/*
+	 * c) transition from FULL_SNAPSHOT to CONSISTENT.
+	 *
+	 * In FULL_SNAPSHOT state (see d) ), and this xl_running_xacts'
+	 * oldestRunningXid is >= than nextXid from when we switched to
+	 * FULL_SNAPSHOT.  This means all transactions that are currently in
+	 * progress have a catalog snapshot, and all their changes have been
+	 * collected.  Switch to CONSISTENT.
+	 */
+	else if (builder->state == SNAPBUILD_FULL_SNAPSHOT &&
+			 TransactionIdPrecedesOrEquals(builder->started_collection_at,
+										   running->oldestRunningXid))
+	{
+		builder->state = SNAPBUILD_CONSISTENT;
+		ereport(LOG,
+				(errmsg("logical decoding found consistent point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail("There are no old transactions anymore.")));
+	}
+	else
+	{
+		SnapBuildWaitSnapshot(running);
 	}
 
 	/*
@@ -1421,8 +1375,35 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * records so incremental cleanup can be performed.
 	 */
 	return true;
+
 }
 
+/*
+ * Iterate through all xids in record, wait for them to finish.
+ *
+ * This isn't required for the correctness of decoding, but to allow
+ * isolationtester to notice that we're currently waiting for something.
+ */
+static void
+SnapBuildWaitSnapshot(xl_running_xacts *running)
+{
+	int			off;
+
+	for (off = 0; off < running->xcnt; off++)
+	{
+		TransactionId xid = running->xids[off];
+
+		/*
+		 * Upper layers should prevent that we ever need to wait on
+		 * ourselves. Check anyway, since failing to do so would either
+		 * result in an endless wait or an Assert() failure.
+		 */
+		if (TransactionIdIsCurrentTransactionId(xid))
+			elog(ERROR, "waiting for ourselves");
+
+		XactLockTableWait(xid, NULL, NULL, XLTW_None);
+	}
+}
 
 /* -----------------------------------
  * Snapshot serialization support
@@ -1572,7 +1553,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				 errmsg("could not remove file \"%s\": %m", path)));
 
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->running.xcnt_space +
+		sizeof(TransactionId) * builder->running_old.xcnt_space +
 		sizeof(TransactionId) * builder->committed.xcnt;
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
@@ -1591,7 +1572,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.context = NULL;
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
-	ondisk->builder.running.xip = NULL;
+	ondisk->builder.running_old.xip = NULL;
 	ondisk->builder.committed.xip = NULL;
 
 	COMP_CRC32C(ondisk->checksum,
@@ -1599,8 +1580,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				sizeof(SnapBuild));
 
 	/* copy running xacts */
-	sz = sizeof(TransactionId) * builder->running.xcnt_space;
-	memcpy(ondisk_c, builder->running.xip, sz);
+	sz = sizeof(TransactionId) * builder->running_old.xcnt_space;
+	memcpy(ondisk_c, builder->running_old.xip, sz);
 	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
 	ondisk_c += sz;
 
@@ -1763,10 +1744,10 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore running xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.running.xcnt_space;
-	ondisk.builder.running.xip = MemoryContextAllocZero(builder->context, sz);
+	sz = sizeof(TransactionId) * ondisk.builder.running_old.xcnt_space;
+	ondisk.builder.running_old.xip = MemoryContextAllocZero(builder->context, sz);
 	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.running.xip, sz);
+	readBytes = read(fd, ondisk.builder.running_old.xip, sz);
 	pgstat_report_wait_end();
 	if (readBytes != sz)
 	{
@@ -1776,7 +1757,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				 errmsg("could not read file \"%s\", read %d of %d: %m",
 						path, readBytes, (int) sz)));
 	}
-	COMP_CRC32C(checksum, ondisk.builder.running.xip, sz);
+	COMP_CRC32C(checksum, ondisk.builder.running_old.xip, sz);
 
 	/* restore committed xacts information */
 	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
@@ -1842,11 +1823,12 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
-	builder->running.xcnt = ondisk.builder.running.xcnt;
-	if (builder->running.xip)
-		pfree(builder->running.xip);
-	builder->running.xcnt_space = ondisk.builder.running.xcnt_space;
-	builder->running.xip = ondisk.builder.running.xip;
+	/* FIXME: remove */
+	builder->running_old.xcnt = ondisk.builder.running_old.xcnt;
+	if (builder->running_old.xip)
+		pfree(builder->running_old.xip);
+	builder->running_old.xcnt_space = ondisk.builder.running_old.xcnt_space;
+	builder->running_old.xip = ondisk.builder.running_old.xip;
 
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
@@ -1867,8 +1849,8 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	return true;
 
 snapshot_not_interesting:
-	if (ondisk.builder.running.xip != NULL)
-		pfree(ondisk.builder.running.xip);
+	if (ondisk.builder.running_old.xip != NULL)
+		pfree(ondisk.builder.running_old.xip);
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
 	return false;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 494751d70a..ccb5f831c4 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -20,24 +20,30 @@ typedef enum
 	/*
 	 * Initial state, we can't do much yet.
 	 */
-	SNAPBUILD_START,
+	SNAPBUILD_START = -1,
+
+	/*
+	 * Collecting committed transactions, to build the initial catalog
+	 * snapshot.
+	 */
+	SNAPBUILD_BUILDING_SNAPSHOT = 0,
 
 	/*
 	 * We have collected enough information to decode tuples in transactions
 	 * that started after this.
 	 *
 	 * Once we reached this we start to collect changes. We cannot apply them
-	 * yet because the might be based on transactions that were still running
-	 * when we reached them yet.
+	 * yet, because they might be based on transactions that were still running
+	 * when FULL_SNAPSHOT was reached.
 	 */
-	SNAPBUILD_FULL_SNAPSHOT,
+	SNAPBUILD_FULL_SNAPSHOT = 1,
 
 	/*
-	 * Found a point after hitting built_full_snapshot where all transactions
-	 * that were running at that point finished. Till we reach that we hold
-	 * off calling any commit callbacks.
+	 * Found a point after SNAPBUILD_FULL_SNAPSHOT where all transactions that
+	 * were running at that point finished. Till we reach that we hold off
+	 * calling any commit callbacks.
 	 */
-	SNAPBUILD_CONSISTENT
+	SNAPBUILD_CONSISTENT = 2
 } SnapBuildState;
 
 /* forward declare so we don't have to expose the struct to the public */
@@ -73,9 +79,6 @@ extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 				   TransactionId xid, int nsubxacts,
 				   TransactionId *subxacts);
-extern void SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
-				  TransactionId xid, int nsubxacts,
-				  TransactionId *subxacts);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 					   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
-- 
2.12.0.264.gd6db3f2165.dirty

#65Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#63)
Re: snapbuild woes

On 05/05/17 02:00, Andres Freund wrote:

Hi,

On 2017-05-02 08:55:53 +0200, Petr Jelinek wrote:

Aah, now I understand we talked about slightly different things, I
considered the running thing to be first step towards tracking aborted
txes everywhere.
I think
we'll have to revisit tracking of aborted transactions in PG11 then
though because of the 'snapshot too large' issue when exporting, at
least I don't see any other way to fix that.

FWIW, that seems unnecessary - we can just check for that using the
clog. Should be very simple to check for aborted xacts when exporting
the snapshot (like 2 lines + comments). That should address your
concern, right?

Right, because there isn't practical difference between running and
aborted transaction for us so we don't mind if the abort has happened in
the future. Yeah the export code could do the check seems quite simple.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#64)
Re: snapbuild woes

On 05/05/17 02:42, Andres Freund wrote:

On 2017-05-04 17:00:04 -0700, Andres Freund wrote:

Attached is a prototype patch for that.

I am not sure I understand the ABI comment for started_collection_at.
What's the ABI issue? The struct is private to snapbuild.c module. Or
you want to store it in the ondisk snapshot as well?

About better name for it what about something like oldest_full_xact?

Otherwise the logic seems to be right on first read, will ponder it a
bit more

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#66)
Re: snapbuild woes

On 2017-05-05 13:53:16 +0200, Petr Jelinek wrote:

On 05/05/17 02:42, Andres Freund wrote:

On 2017-05-04 17:00:04 -0700, Andres Freund wrote:

Attached is a prototype patch for that.

I am not sure I understand the ABI comment for started_collection_at.
What's the ABI issue? The struct is private to snapbuild.c module. Or
you want to store it in the ondisk snapshot as well?

It's stored on-disk already :(

Otherwise the logic seems to be right on first read, will ponder it a
bit more

Cool!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#67)
Re: snapbuild woes

On 05/05/17 18:18, Andres Freund wrote:

On 2017-05-05 13:53:16 +0200, Petr Jelinek wrote:

On 05/05/17 02:42, Andres Freund wrote:

On 2017-05-04 17:00:04 -0700, Andres Freund wrote:

Attached is a prototype patch for that.

I am not sure I understand the ABI comment for started_collection_at.
What's the ABI issue? The struct is private to snapbuild.c module. Or
you want to store it in the ondisk snapshot as well?

It's stored on-disk already :(

Hmm okay, well then I guess we'll have to store it somewhere inside
running, we should not normally care about that since we only load
CONSISTENT snapshots where running don't matter anymore. Alternatively
we could bump the SNAPBUILD_VERSION but that would make it impossible to
downgrade minor version which is bad (I guess we'll want to do that for
master, but not back-branches).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#61)
Re: snapbuild woes

On May 3, 2017 10:45:16 PM PDT, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote:

On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com>

wrote:

On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:

On 2017-04-17 21:16:57 -0700, Andres Freund wrote:

I've since the previous update reviewed Petr's patch, which he

since has

updated over the weekend. I'll do another round tomorrow, and

will

see

how it looks. I think we might need some more tests for this to

be

committable, so it might not become committable tomorrow. I

hope

we'll

have something in tree by end of this week, if not I'll send an

update.

I was less productive this week than I'd hoped, and creating a

testsuite

was more work than I'd anticipated, so I'm slightly lagging

behind.

I

hope to have a patchset tomorrow, aiming to commit something
Monday/Tuesday.

This PostgreSQL 10 open item is past due for your status update.
Kindly send
a status update within 24 hours, and include a date for your

subsequent

status
update. Refer to the policy on open item ownership:

/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I committed part of the series today, plan to continue doing so over

the next few days. Changes require careful review & testing, this is
easy to get wrong...

This PostgreSQL 10 open item is past due for your status update.
Kindly send
a status update within 24 hours, and include a date for your subsequent
status
update.

Also, this open item has been alive for three weeks, well above
guideline. I
understand it's a tricky bug, but I'm worried this isn't on track to
end.
What is missing to make it end?

Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I plan to commit the next pending patch after the back branch releases are cut - it's an invasive fix and the issue doesn't cause corruption "just" slow slot creation. So it seems better to wait for a few days, rather than hurry it into the release.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Erik Rijkers
er@xs4all.nl
In reply to: Andres Freund (#63)
Re: snapbuild woes

On 2017-05-05 02:00, Andres Freund wrote:

Could you have a look?

Running tests with these three patches:

0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch

(on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at least
1 of the 4 pgbench tables is not md5-equal). It seems better, faster
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state
'streaming'.

pid | wal | replay_loc | diff | state | app |
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 |
async

Often there are another two processes in pg_stat_replication that remain
in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in
failing sessions there are only the two processes that are and remain in
'startup'.

FWIW, below the output of a succesful and a failed run:

------------------ successful run:
creating tables...
1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s)
2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2'
publication pub1 with (disabled);
alter subscription sub1 enable;
-- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25
progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785
progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440
progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130
progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841
progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 38175
latency average = 2128.606 ms
latency stddev = 3948.634 ms
tps = 42.151205 (including connections establishing)
tps = 42.151589 (excluding connections establishing)
-- waiting 0s... (always)
port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 | 0 |
t | streaming | derail2 | async

2017.05.08 23:19:22
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837
d1afac950 d4abcc72e master
6973 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837
d1afac950 d4abcc72e replica ok bee2312c7
2017.05.08 23:20:48

port | pg_stat_replication | pid | wal | replay_loc | diff
| ?column? | state | app | sync_state
6972 | pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 |
95357856 | f | streaming | derail2 | async
------------------------------------

------------------ failure:
creating tables...
1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s)
2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2'
publication pub1 with (disabled);
alter subscription sub1 enable;
port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 11945 | 18/5E2913D0 | | |
| catchup | derail2 | async

-- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25
progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815
progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231
progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400
progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066
progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 58846
latency average = 1378.259 ms
latency stddev = 2304.159 ms
tps = 65.224168 (including connections establishing)
tps = 65.224788 (excluding connections establishing)
-- waiting 0s... (always)
port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 11948 | 18/7469A038 | | |
| startup | derail2 | async
6972 | pg_stat_replication | 12372 | 18/7469A038 | | |
| startup | derail2 | async

------------------------------------

During my tests, I keep an eye on pg_stat_replication (refreshing every
2s), and when I see those two 'startup'-lines in pg_stat_replication
without any 'streaming'-line I know the test is going to fail. I
believe this has been true for all failure cases that I've seen (except
the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another
thread).

Sorry, I have not been able to get any thing more clear or definitive...

thanks,

Erik Rijkers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Erik Rijkers (#70)
Re: snapbuild woes

On 09/05/17 00:03, Erik Rijkers wrote:

On 2017-05-05 02:00, Andres Freund wrote:

Could you have a look?

Running tests with these three patches:

0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch

(on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at least
1 of the 4 pgbench tables is not md5-equal). It seems better, faster
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state
'streaming'.

pid | wal | replay_loc | diff | state | app |
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | async

Often there are another two processes in pg_stat_replication that remain
in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in
failing sessions there are only the two processes that are and remain in
'startup'.

Hmm, startup is the state where slot creation is happening. I wonder if
it's just taking long time to create snapshot because of the 5th issue
which is not yet fixed (and the original patch will not apply on top of
this change). Alternatively there is a bug in this patch.

Did you see high CPU usage during the test when there were those
"startup" state walsenders?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Erik Rijkers
er@xs4all.nl
In reply to: Petr Jelinek (#71)
Re: snapbuild woes

On 2017-05-09 10:50, Petr Jelinek wrote:

On 09/05/17 00:03, Erik Rijkers wrote:

On 2017-05-05 02:00, Andres Freund wrote:

Could you have a look?

Running tests with these three patches:

0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch

(on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at
least
1 of the 4 pgbench tables is not md5-equal). It seems better, faster
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state
'streaming'.

pid | wal | replay_loc | diff | state | app |
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 |
async

Often there are another two processes in pg_stat_replication that
remain
in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in
failing sessions there are only the two processes that are and remain
in
'startup'.

Hmm, startup is the state where slot creation is happening. I wonder if
it's just taking long time to create snapshot because of the 5th issue
which is not yet fixed (and the original patch will not apply on top of
this change). Alternatively there is a bug in this patch.

Did you see high CPU usage during the test when there were those
"startup" state walsenders?

I haven't noticed but I didn't pay attention to that particularly.

I'll try to get some CPU-info logged...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Erik Rijkers (#72)
1 attachment(s)
Re: snapbuild woes

On 09/05/17 10:59, Erik Rijkers wrote:

On 2017-05-09 10:50, Petr Jelinek wrote:

On 09/05/17 00:03, Erik Rijkers wrote:

On 2017-05-05 02:00, Andres Freund wrote:

Could you have a look?

Running tests with these three patches:

0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch

(on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at least
1 of the 4 pgbench tables is not md5-equal). It seems better, faster
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state
'streaming'.

pid | wal | replay_loc | diff | state | app |
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 |
async

Often there are another two processes in pg_stat_replication that remain
in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in
failing sessions there are only the two processes that are and remain in
'startup'.

Hmm, startup is the state where slot creation is happening. I wonder if
it's just taking long time to create snapshot because of the 5th issue
which is not yet fixed (and the original patch will not apply on top of
this change). Alternatively there is a bug in this patch.

Did you see high CPU usage during the test when there were those
"startup" state walsenders?

I haven't noticed but I didn't pay attention to that particularly.

I'll try to get some CPU-info logged...

I rebased the above mentioned patch to apply to the patches Andres sent,
if you could try to add it on top of what you have and check if it still
fails, that would be helpful.

Thanks!

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

Skip-unnecessary-snapshot-builds.patchbinary/octet-stream; name=Skip-unnecessary-snapshot-builds.patchDownload
From 1d1071aaacdb64228d195a1b5234be0f4716ce5c Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Tue, 9 May 2017 11:49:00 +0200
Subject: [PATCH] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
 src/backend/replication/logical/snapbuild.c | 100 ++++++++++++++++------------
 1 file changed, 58 insertions(+), 42 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1176d20..a6c0b66 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -912,9 +912,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 {
 	int			nxact;
 
-	bool		forced_timetravel = false;
-	bool		sub_needs_timetravel = false;
-	bool		top_needs_timetravel = false;
+	bool		need_timetravel = false;
+	bool		need_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -934,12 +933,22 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			builder->start_decoding_at = lsn + 1;
 
 		/*
-		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
-		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * When building full snapshot we need to keep track of all
+		 * transactions.
 		 */
-		forced_timetravel = true;
-		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+		if (builder->building_full_snapshot)
+		{
+			need_timetravel = true;
+			elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+		}
+
+		/*
+		 * If we could not observe the just finished transaction since it
+		 * started (because it started before we started tracking), we'll
+		 * always need a snapshot.
+		 */
+		if (TransactionIdPrecedes(xid, builder->started_collection_at))
+			need_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -947,23 +956,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		TransactionId subxid = subxacts[nxact];
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
-			sub_needs_timetravel = true;
+			need_timetravel = true;
+			need_snapshot = true;
 
 			elog(DEBUG1, "found subtransaction %u:%u with catalog changes.",
 				 xid, subxid);
@@ -973,31 +972,40 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we have already decided that timetravel is needed for this
+		 * transaction, we also need visibility information about
+		 * subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (need_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
-	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	/*
+	 * Add toplevel transaction to base snapshot if it made any cataog
+	 * changes...
+	 */
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
 
-		top_needs_timetravel = true;
+		need_timetravel = true;
+		need_snapshot = true;
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
-	else if (sub_needs_timetravel)
+	/* ... or if previous checks decided we need timetravel anyway. */
+	else if (need_timetravel)
 	{
-		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
-	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
+	if (need_timetravel)
 	{
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
@@ -1018,15 +1026,25 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
 			return;
 
+		/* We always need to build snapshot if there isn't one yet. */
+		need_snapshot = need_snapshot || !builder->snapshot;
+
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
 		 */
-		if (builder->snapshot)
+		if (builder->snapshot && need_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (need_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1036,11 +1054,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (need_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

#74Erik Rijkers
er@xs4all.nl
In reply to: Petr Jelinek (#73)
Re: snapbuild woes

On 2017-05-09 11:50, Petr Jelinek wrote:

On 09/05/17 10:59, Erik Rijkers wrote:

On 2017-05-09 10:50, Petr Jelinek wrote:

On 09/05/17 00:03, Erik Rijkers wrote:

On 2017-05-05 02:00, Andres Freund wrote:

Could you have a look?

[...]

I rebased the above mentioned patch to apply to the patches Andres
sent,
if you could try to add it on top of what you have and check if it
still
fails, that would be helpful.

I suppose you mean these; but they do not apply anymore:

20170505/0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch
20170505/0002-WIP-Possibly-more-robust-snapbuild-approach.patch

Andres, any change you could update them?

alternatively I could use the older version again..

thanks,

Erik Rijkers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Erik Rijkers
er@xs4all.nl
In reply to: Petr Jelinek (#73)
1 attachment(s)
Re: snapbuild woes

On 2017-05-09 11:50, Petr Jelinek wrote:

I rebased the above mentioned patch to apply to the patches Andres
sent,
if you could try to add it on top of what you have and check if it
still
fails, that would be helpful.

It still fails.

With these patches

- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 00002-WIP-Possibly-more-robust-snapbuild-approach.patch +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch

built again on top of 44c528810a1 ( so I had to add the
'fix-statistics-rep*' patch because without it I immediately got that
Assertion failure again ).

As always most runs succeed (especially on this large 192GB 16-core
server).

But attached is an output file of a number of runs of my
pgbench_derail2.sh test.

Overal result:

-- out_20170509_1635.txt
3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25
2 -- All is well.
1 -- Not good, but breaking out of wait (21 times no change)

I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I
did) - not interesting.

iteration 2 succeeds. (eventually has 'replica ok')

iteration 3 succeeds. (eventually has 'replica ok')

iteration 4 fails.
Just after 'alter subscription sub1 enable' I caught (as is usual)
pg_stat_replication.state as 'catchup'. So far so good.
After the 15-minute pgbench run pg_stat_replication has only 2
'startup' lines (and none 'catchup' or 'streaming'):

port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | |
| startup | derail2 | async
6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | |
| startup | derail2 | async

(that's from:
select $port1 as port,'pg_stat_replication' as pg_stat_replication,
pid
, pg_current_wal_location() wal, replay_location replay_loc,
pg_current_wal_location() - replay_location as diff
, pg_current_wal_location() <= replay_location
, state, application_name as app, sync_state
from pg_stat_replication
)

This remains in this state for as long as my test-programs lets it
(i.e., 20 x 30s, or something like that, and then the loop is exited);
in the ouput file it says: 'Not good, but breaking out of wait'

Below is the accompanying ps (with the 2 'deranged senders' as Jeff
Janes would surely call them):

UID PID PPID C STIME TTY STAT TIME CMD
rijkers 107147 1 0 17:11 pts/35 S+ 0:00
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres:
checkpointer process
rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres:
writer process
rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70293
rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70298
rijkers 107145 1 0 17:11 pts/35 S+ 0:02
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres:
checkpointer process
rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres:
writer process
rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal
writer process
rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats
collector process
rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal
sender process rijkers [local] idle
rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal
sender process rijkers [local] idle

I have had no time to add (or view) any CPUinfo.

Erik Rijkers

Attachments:

out_20170509_1635.txtapplication/x-elc; name=out_20170509_1635.txtDownload
#76Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Erik Rijkers (#75)
Re: snapbuild woes

On 09/05/17 19:54, Erik Rijkers wrote:

On 2017-05-09 11:50, Petr Jelinek wrote:

I rebased the above mentioned patch to apply to the patches Andres sent,
if you could try to add it on top of what you have and check if it still
fails, that would be helpful.

It still fails.

With these patches

- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 00002-WIP-Possibly-more-robust-snapbuild-approach.patch +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch

built again on top of 44c528810a1 ( so I had to add the
'fix-statistics-rep*' patch because without it I immediately got that
Assertion failure again ).

As always most runs succeed (especially on this large 192GB 16-core
server).

But attached is an output file of a number of runs of my
pgbench_derail2.sh test.

Overal result:

-- out_20170509_1635.txt
3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25
2 -- All is well.
1 -- Not good, but breaking out of wait (21 times no change)

I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I
did) - not interesting.

iteration 2 succeeds. (eventually has 'replica ok')

iteration 3 succeeds. (eventually has 'replica ok')

iteration 4 fails.
Just after 'alter subscription sub1 enable' I caught (as is usual)
pg_stat_replication.state as 'catchup'. So far so good.
After the 15-minute pgbench run pg_stat_replication has only 2
'startup' lines (and none 'catchup' or 'streaming'):

port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | |
| startup | derail2 | async
6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | |
| startup | derail2 | async

(that's from:
select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid
, pg_current_wal_location() wal, replay_location replay_loc,
pg_current_wal_location() - replay_location as diff
, pg_current_wal_location() <= replay_location
, state, application_name as app, sync_state
from pg_stat_replication
)

This remains in this state for as long as my test-programs lets it
(i.e., 20 x 30s, or something like that, and then the loop is exited);
in the ouput file it says: 'Not good, but breaking out of wait'

Below is the accompanying ps (with the 2 'deranged senders' as Jeff
Janes would surely call them):

UID PID PPID C STIME TTY STAT TIME CMD
rijkers 107147 1 0 17:11 pts/35 S+ 0:00
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres:
checkpointer process
rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres:
writer process
rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70293
rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70298
rijkers 107145 1 0 17:11 pts/35 S+ 0:02
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres:
checkpointer process
rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres:
writer process
rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal
writer process
rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats
collector process
rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal
sender process rijkers [local] idle
rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal
sender process rijkers [local] idle

I have had no time to add (or view) any CPUinfo.

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1]: /messages/by-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com

[1]: /messages/by-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com
/messages/by-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com
[2]: /messages/by-id/CAMkU=1xUJKs=2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw@mail.gmail.com
/messages/by-id/CAMkU=1xUJKs=2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw@mail.gmail.com

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Erik Rijkers
er@xs4all.nl
In reply to: Petr Jelinek (#76)
Re: snapbuild woes

On 2017-05-09 21:00, Petr Jelinek wrote:

On 09/05/17 19:54, Erik Rijkers wrote:

On 2017-05-09 11:50, Petr Jelinek wrote:

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
/messages/by-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com
[2]
/messages/by-id/CAMkU=1xUJKs=2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw@mail.gmail.com

I don't understand why you come to that conclusion: both Masahiko Sawada
and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't.
Isn't that a real difference?

( I do sometimes get that DROP-SUBSCRIPTION too, but much less often
than the sync-failure. )

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Erik Rijkers (#77)
Re: snapbuild woes

On 09/05/17 22:11, Erik Rijkers wrote:

On 2017-05-09 21:00, Petr Jelinek wrote:

On 09/05/17 19:54, Erik Rijkers wrote:

On 2017-05-09 11:50, Petr Jelinek wrote:

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
/messages/by-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com

[2]
/messages/by-id/CAMkU=1xUJKs=2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw@mail.gmail.com

I don't understand why you come to that conclusion: both Masahiko Sawada
and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't.
Isn't that a real difference?

Because I only see the sync workers running in the output you shown, the
bug described in those threads can as one of side effects cause the sync
workers to wait forever for the apply that was killed, crashed, etc, in
the right moment, which I guess is what happened during your failed test
(there should be some info in the log about apply exiting).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Petr Jelinek (#78)
Re: snapbuild woes

On 10 May 2017, at 11:43, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

On 09/05/17 22:11, Erik Rijkers wrote:

On 2017-05-09 21:00, Petr Jelinek wrote:

On 09/05/17 19:54, Erik Rijkers wrote:

On 2017-05-09 11:50, Petr Jelinek wrote:

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
/messages/by-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com

[2]
/messages/by-id/CAMkU=1xUJKs=2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw@mail.gmail.com

I don't understand why you come to that conclusion: both Masahiko Sawada
and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't.
Isn't that a real difference?

Just noted another one issue/feature with snapshot builder — when logical decoding is in progress
and vacuum full command is being issued quite a big amount of files appears in pg_logical/mappings
and reside there till the checkpoint. Also having consequent vacuum full’s before checkpoint yields even
more files.

Having two pgbench-filled instances with logical replication between them:

for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc -l; done;
VACUUM
73
VACUUM
454
VACUUM
1146
VACUUM
2149
VACUUM
3463
VACUUM
5088
<...>
VACUUM
44708
<…> // checkpoint happens somewhere here
VACUUM
20921
WARNING: could not truncate file "base/16384/61773": Too many open files in system
ERROR: could not fsync file "pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in system

I’m not sure whether this is boils down to some of the previous issues mentioned here or not, so posting
here as observation.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Stas Kelvich (#79)
Re: snapbuild woes

On 11/05/17 16:33, Stas Kelvich wrote:

On 10 May 2017, at 11:43, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

On 09/05/17 22:11, Erik Rijkers wrote:

On 2017-05-09 21:00, Petr Jelinek wrote:

On 09/05/17 19:54, Erik Rijkers wrote:

On 2017-05-09 11:50, Petr Jelinek wrote:

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
/messages/by-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com

[2]
/messages/by-id/CAMkU=1xUJKs=2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw@mail.gmail.com

I don't understand why you come to that conclusion: both Masahiko Sawada
and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't.
Isn't that a real difference?

Just noted another one issue/feature with snapshot builder — when logical decoding is in progress
and vacuum full command is being issued quite a big amount of files appears in pg_logical/mappings
and reside there till the checkpoint. Also having consequent vacuum full’s before checkpoint yields even
more files.

Having two pgbench-filled instances with logical replication between them:

for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc -l; done;
VACUUM
73
VACUUM
454
VACUUM
1146
VACUUM
2149
VACUUM
3463
VACUUM
5088
<...>
VACUUM
44708
<…> // checkpoint happens somewhere here
VACUUM
20921
WARNING: could not truncate file "base/16384/61773": Too many open files in system
ERROR: could not fsync file "pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in system

I’m not sure whether this is boils down to some of the previous issues mentioned here or not, so posting
here as observation.

This has nothing to do with snapshot builder so this is not the thread
for it. See the comment section near the bottom of
src/backend/access/heap/rewriteheap.c for explanation of why it does
what it does.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Noname
andres@anarazel.de
In reply to: Andres Freund (#69)
Re: snapbuild woes

On 2017-05-08 00:10:12 -0700, Andres Freund wrote:

I plan to commit the next pending patch after the back branch releases
are cut - it's an invasive fix and the issue doesn't cause corruption
"just" slow slot creation. So it seems better to wait for a few days,
rather than hurry it into the release.

Now that that's done, here's an updated version of that patch. Note the
new logic to trigger xl_running_xact's to be logged at the right spot.
Works well in my testing.

I plan to commit this fairly soon, unless somebody wants a bit more time
to look into it.

Unless somebody protests, I'd like to slightly revise how the on-disk
snapshots are stored on master - given the issues this bug/commit showed
with the current method - but I can see one could argue that that should
rather be done next release.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Noname (#81)
Re: snapbuild woes

On Thu, May 11, 2017 at 2:51 PM, Andres Freund <andres@anarazel.de> wrote:

Now that that's done, here's an updated version of that patch. Note the
new logic to trigger xl_running_xact's to be logged at the right spot.
Works well in my testing.

You forgot the patch. :-)

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Andres Freund
andres@anarazel.de
In reply to: Noname (#81)
1 attachment(s)
Re: snapbuild woes

On 2017-05-11 14:51:55 -0700, wrote:

On 2017-05-08 00:10:12 -0700, Andres Freund wrote:

I plan to commit the next pending patch after the back branch releases
are cut - it's an invasive fix and the issue doesn't cause corruption
"just" slow slot creation. So it seems better to wait for a few days,
rather than hurry it into the release.

Now that that's done, here's an updated version of that patch. Note the
new logic to trigger xl_running_xact's to be logged at the right spot.
Works well in my testing.

I plan to commit this fairly soon, unless somebody wants a bit more time
to look into it.

Unless somebody protests, I'd like to slightly revise how the on-disk
snapshots are stored on master - given the issues this bug/commit showed
with the current method - but I can see one could argue that that should
rather be done next release.

As usual I forgot my attachement.

- Andres

Attachments:

0001-Fix-race-condition-leading-to-hanging-logical-slot-c.patchtext/x-patch; charset=us-asciiDownload
From 3ecec368e3015f5082f120b1a428d1108203a356 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 4 May 2017 16:48:00 -0700
Subject: [PATCH] Fix race condition leading to hanging logical slot creation.

The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records.  Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.

That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions.  That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.

It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.

Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code.  That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
   we cannot use it to build the initial snapshot.  Instead we have to
   wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
   the all transaction perceived as running based on commit/abort
   records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.

Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release.  As this is going to be backpatched, we have to play
around a bit to keep on-disk compatibility.  A later commit will
rejigger that on master.

Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
---
 contrib/test_decoding/expected/ondisk_startup.out |  15 +-
 contrib/test_decoding/specs/ondisk_startup.spec   |   8 +-
 src/backend/replication/logical/decode.c          |   3 -
 src/backend/replication/logical/reorderbuffer.c   |   2 +-
 src/backend/replication/logical/snapbuild.c       | 416 ++++++++++------------
 src/include/replication/snapbuild.h               |  25 +-
 6 files changed, 220 insertions(+), 249 deletions(-)

diff --git a/contrib/test_decoding/expected/ondisk_startup.out b/contrib/test_decoding/expected/ondisk_startup.out
index 65115c830a..c7b1f45b46 100644
--- a/contrib/test_decoding/expected/ondisk_startup.out
+++ b/contrib/test_decoding/expected/ondisk_startup.out
@@ -1,21 +1,30 @@
 Parsed test spec with 3 sessions
 
-starting permutation: s2txid s1init s3txid s2alter s2c s1insert s1checkpoint s1start s1insert s1alter s1insert s1start
-step s2txid: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL;
+starting permutation: s2b s2txid s1init s3b s3txid s2alter s2c s2b s2txid s3c s2c s1insert s1checkpoint s1start s1insert s1alter s1insert s1start
+step s2b: BEGIN;
+step s2txid: SELECT txid_current() IS NULL;
 ?column?       
 
 f              
 step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
-step s3txid: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL;
+step s3b: BEGIN;
+step s3txid: SELECT txid_current() IS NULL;
 ?column?       
 
 f              
 step s2alter: ALTER TABLE do_write ADD COLUMN addedbys2 int;
 step s2c: COMMIT;
+step s2b: BEGIN;
+step s2txid: SELECT txid_current() IS NULL;
+?column?       
+
+f              
+step s3c: COMMIT;
 step s1init: <... completed>
 ?column?       
 
 init           
+step s2c: COMMIT;
 step s1insert: INSERT INTO do_write DEFAULT VALUES;
 step s1checkpoint: CHECKPOINT;
 step s1start: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false');
diff --git a/contrib/test_decoding/specs/ondisk_startup.spec b/contrib/test_decoding/specs/ondisk_startup.spec
index 8223705639..12c57a813d 100644
--- a/contrib/test_decoding/specs/ondisk_startup.spec
+++ b/contrib/test_decoding/specs/ondisk_startup.spec
@@ -24,7 +24,8 @@ step "s1alter" { ALTER TABLE do_write ADD COLUMN addedbys1 int; }
 session "s2"
 setup { SET synchronous_commit=on; }
 
-step "s2txid" { BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL; }
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT txid_current() IS NULL; }
 step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }
 step "s2c" { COMMIT; }
 
@@ -32,7 +33,8 @@ step "s2c" { COMMIT; }
 session "s3"
 setup { SET synchronous_commit=on; }
 
-step "s3txid" { BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT txid_current() IS NULL; }
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT txid_current() IS NULL; }
 step "s3c" { COMMIT; }
 
 # Force usage of ondisk snapshot by starting and not finishing a
@@ -40,4 +42,4 @@ step "s3c" { COMMIT; }
 # reached. In combination with a checkpoint forcing a snapshot to be
 # written and a new restart point computed that'll lead to the usage
 # of the snapshot.
-permutation "s2txid" "s1init" "s3txid" "s2alter" "s2c" "s1insert" "s1checkpoint" "s1start" "s1insert" "s1alter" "s1insert" "s1start"
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s2b" "s2txid" "s3c" "s2c" "s1insert" "s1checkpoint" "s1start" "s1insert" "s1alter" "s1insert" "s1start"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26099..68825ef598 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -622,9 +622,6 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 {
 	int			i;
 
-	SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
-					  parsed->nsubxacts, parsed->subxacts);
-
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0c174be8ed..9d882dac56 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1725,7 +1725,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			elog(DEBUG1, "aborting old transaction %u", txn->xid);
+			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
 			ReorderBufferCleanupTXN(rb, txn);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 068d214fa1..0f2dcb84be 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -56,23 +56,34 @@
  *
  *
  * The snapbuild machinery is starting up in several stages, as illustrated
- * by the following graph:
+ * by the following graph describing the SnapBuild->state transitions:
+ *
  *		   +-------------------------+
- *	  +----|SNAPBUILD_START			 |-------------+
+ *	  +----|         START			 |-------------+
  *	  |    +-------------------------+			   |
  *	  |					|						   |
  *	  |					|						   |
- *	  |		running_xacts with running xacts	   |
+ *	  |		   running_xacts #1					   |
  *	  |					|						   |
  *	  |					|						   |
  *	  |					v						   |
  *	  |    +-------------------------+			   v
- *	  |    |SNAPBUILD_FULL_SNAPSHOT  |------------>|
+ *	  |    |   BUILDING_SNAPSHOT     |------------>|
  *	  |    +-------------------------+			   |
+ *	  |					|						   |
+ *	  |					|						   |
+ *	  |	running_xacts #2, xacts from #1 finished   |
+ *	  |					|						   |
+ *	  |					|						   |
+ *	  |					v						   |
+ *	  |    +-------------------------+			   v
+ *	  |    |       FULL_SNAPSHOT     |------------>|
+ *	  |    +-------------------------+			   |
+ *	  |					|						   |
  * running_xacts		|					   saved snapshot
  * with zero xacts		|				  at running_xacts's lsn
  *	  |					|						   |
- *	  |		all running toplevel TXNs finished	   |
+ *	  |	running_xacts with xacts from #2 finished  |
  *	  |					|						   |
  *	  |					v						   |
  *	  |    +-------------------------+			   |
@@ -83,7 +94,7 @@
  * record is read that is sufficiently new (above the safe xmin horizon),
  * there's a state transition. If there were no running xacts when the
  * running_xacts record was generated, we'll directly go into CONSISTENT
- * state, otherwise we'll switch to the FULL_SNAPSHOT state. Having a full
+ * state, otherwise we'll switch to the BUILDING_SNAPSHOT state. Having a full
  * snapshot means that all transactions that start henceforth can be decoded
  * in their entirety, but transactions that started previously can't. In
  * FULL_SNAPSHOT we'll switch into CONSISTENT once all those previously
@@ -184,26 +195,24 @@ struct SnapBuild
 	ReorderBuffer *reorder;
 
 	/*
-	 * Information about initially running transactions
-	 *
-	 * When we start building a snapshot there already may be transactions in
-	 * progress.  Those are stored in running.xip.  We don't have enough
-	 * information about those to decode their contents, so until they are
-	 * finished (xcnt=0) we cannot switch to a CONSISTENT state.
+	 * Outdated: This struct isn't used for its original purpose anymore, but
+	 * can't be removed / changed in a minor version, because it's stored
+	 * on-disk.
 	 */
 	struct
 	{
 		/*
-		 * As long as running.xcnt all XIDs < running.xmin and > running.xmax
-		 * have to be checked whether they still are running.
+		 * NB: This field is misused, until a major version can break on-disk
+		 * compatibility. See SnapBuildNextPhaseAt() /
+		 * SnapBuildStartNextPhaseAt().
 		 */
-		TransactionId xmin;
-		TransactionId xmax;
+		TransactionId was_xmin;
+		TransactionId was_xmax;
 
-		size_t		xcnt;		/* number of used xip entries */
-		size_t		xcnt_space; /* allocated size of xip */
-		TransactionId *xip;		/* running xacts array, xidComparator-sorted */
-	}			running;
+		size_t		was_xcnt;		/* number of used xip entries */
+		size_t		was_xcnt_space; /* allocated size of xip */
+		TransactionId *was_xip;		/* running xacts array, xidComparator-sorted */
+	}			was_running;
 
 	/*
 	 * Array of transactions which could have catalog changes that committed
@@ -249,12 +258,6 @@ struct SnapBuild
 static ResourceOwner SavedResourceOwnerDuringExport = NULL;
 static bool ExportInProgress = false;
 
-/* transaction state manipulation functions */
-static void SnapBuildEndTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
-
-/* ->running manipulation */
-static bool SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid);
-
 /* ->committed manipulation */
 static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
 
@@ -269,11 +272,39 @@ static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr
 
 /* xlog reading helper functions for SnapBuildProcessRecord */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
+static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
 static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
 
+/*
+ * Return TransactionId after which the next phase of initial snapshot
+ * building will happen.
+ */
+static inline TransactionId
+SnapBuildNextPhaseAt(SnapBuild *builder)
+{
+	/*
+	 * For backward compatibility reasons this has to be stored in the wrongly
+	 * named field.  Will be fixed in next major version.
+	 */
+	return builder->was_running.was_xmax;
+}
+
+/*
+ * Set TransactionId after which the next phase of initial snapshot building
+ * will happen.
+ */
+static inline void
+SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at)
+{
+	/*
+	 * For backward compatibility reasons this has to be stored in the wrongly
+	 * named field.  Will be fixed in next major version.
+	 */
+	builder->was_running.was_xmax = at;
+}
 
 /*
  * Allocate a new snapshot builder.
@@ -700,7 +731,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
 	 * we got into the SNAPBUILD_FULL_SNAPSHOT state.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT &&
-		SnapBuildTxnIsRunning(builder, xid))
+		TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder)))
 		return false;
 
 	/*
@@ -769,38 +800,6 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Check whether `xid` is currently 'running'.
- *
- * Running transactions in our parlance are transactions which we didn't
- * observe from the start so we can't properly decode their contents. They
- * only exist after we freshly started from an < CONSISTENT snapshot.
- */
-static bool
-SnapBuildTxnIsRunning(SnapBuild *builder, TransactionId xid)
-{
-	Assert(builder->state < SNAPBUILD_CONSISTENT);
-	Assert(TransactionIdIsNormal(builder->running.xmin));
-	Assert(TransactionIdIsNormal(builder->running.xmax));
-
-	if (builder->running.xcnt &&
-		NormalTransactionIdFollows(xid, builder->running.xmin) &&
-		NormalTransactionIdPrecedes(xid, builder->running.xmax))
-	{
-		TransactionId *search =
-		bsearch(&xid, builder->running.xip, builder->running.xcnt_space,
-				sizeof(TransactionId), xidComparator);
-
-		if (search != NULL)
-		{
-			Assert(*search == xid);
-			return true;
-		}
-	}
-
-	return false;
-}
-
-/*
  * Add a new Snapshot to all transactions we're decoding that currently are
  * in-progress so they can see new catalog contents made by the transaction
  * that just committed. This is necessary because those in-progress
@@ -922,63 +921,6 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 }
 
 /*
- * Common logic for SnapBuildAbortTxn and SnapBuildCommitTxn dealing with
- * keeping track of the amount of running transactions.
- */
-static void
-SnapBuildEndTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
-{
-	if (builder->state == SNAPBUILD_CONSISTENT)
-		return;
-
-	/*
-	 * NB: This handles subtransactions correctly even if we started from
-	 * suboverflowed xl_running_xacts because we only keep track of toplevel
-	 * transactions. Since the latter are always allocated before their
-	 * subxids and since they end at the same time it's sufficient to deal
-	 * with them here.
-	 */
-	if (SnapBuildTxnIsRunning(builder, xid))
-	{
-		Assert(builder->running.xcnt > 0);
-
-		if (!--builder->running.xcnt)
-		{
-			/*
-			 * None of the originally running transaction is running anymore,
-			 * so our incrementally built snapshot now is consistent.
-			 */
-			ereport(LOG,
-				  (errmsg("logical decoding found consistent point at %X/%X",
-						  (uint32) (lsn >> 32), (uint32) lsn),
-				   errdetail("Transaction ID %u finished; no more running transactions.",
-							 xid)));
-			builder->state = SNAPBUILD_CONSISTENT;
-		}
-	}
-}
-
-/*
- * Abort a transaction, throw away all state we kept.
- */
-void
-SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
-				  TransactionId xid,
-				  int nsubxacts, TransactionId *subxacts)
-{
-	int			i;
-
-	for (i = 0; i < nsubxacts; i++)
-	{
-		TransactionId subxid = subxacts[i];
-
-		SnapBuildEndTxn(builder, lsn, subxid);
-	}
-
-	SnapBuildEndTxn(builder, lsn, xid);
-}
-
-/*
  * Handle everything that needs to be done when a transaction commits
  */
 void
@@ -1022,11 +964,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		TransactionId subxid = subxacts[nxact];
 
 		/*
-		 * make sure txn is not tracked in running txn's anymore, switch state
-		 */
-		SnapBuildEndTxn(builder, lsn, subxid);
-
-		/*
 		 * If we're forcing timetravel we also need visibility information
 		 * about subtransaction, so keep track of subtransaction's state.
 		 */
@@ -1055,12 +992,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		}
 	}
 
-	/*
-	 * Make sure toplevel txn is not tracked in running txn's anymore, switch
-	 * state to consistent if possible.
-	 */
-	SnapBuildEndTxn(builder, lsn, xid);
-
 	if (forced_timetravel)
 	{
 		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
@@ -1250,25 +1181,20 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) or c).
+	 *	  state while waiting on c)'s sub-states.
 	 *
-	 * b) Wait for all toplevel transactions that were running to end. We
-	 *	  simply track the number of in-progress toplevel transactions and
-	 *	  lower it whenever one commits or aborts. When that number
-	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
-	 *	  NB: We need to search running.xip when seeing a transaction's end to
-	 *	  make sure it's a toplevel transaction and it's been one of the
-	 *	  initially running ones.
-	 *	  Interestingly, in contrast to HS, this allows us not to care about
-	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
-	 *	  at all.
-	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
+	 * b) This (in a previous run) or another decoding slot serialized a
 	 *	  snapshot to disk that we can use.  Can't use this method for the
 	 *	  initial snapshot when slot is being created and needs full snapshot
 	 *	  for export or direct use, as that snapshot will only contain catalog
 	 *	  modifying transactions.
+	 *
+	 * c) First incrementally build a snapshot for catalog tuples
+	 *    (BUILDING_SNAPSHOT), that requires all, already in-progress,
+	 *    transactions to finish.  Every transaction starting after that
+	 *    (FULL_SNAPSHOT state), has enough information to be decoded.  But
+	 *    for older running transactions no viable snapshot exists yet, so
+	 *    CONSISTENT will only be reached once all of those have finished.
 	 * ---
 	 */
 
@@ -1285,16 +1211,23 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 								 (uint32) (lsn >> 32), (uint32) lsn),
 		errdetail_internal("initial xmin horizon of %u vs the snapshot's %u",
 				 builder->initial_xmin_horizon, running->oldestRunningXid)));
+
+
+		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
+
 		return true;
 	}
 
 	/*
 	 * a) No transaction were running, we can jump to consistent.
 	 *
+	 * This is not affected by races around xl_running_xacts, because we can
+	 * miss transaction commits, but currently not transactions starting.
+	 *
 	 * NB: We might have already started to incrementally assemble a snapshot,
 	 * so we need to be careful to deal with that.
 	 */
-	if (running->xcnt == 0)
+	if (running->oldestRunningXid == running->nextXid)
 	{
 		if (builder->start_decoding_at == InvalidXLogRecPtr ||
 			builder->start_decoding_at <= lsn)
@@ -1309,12 +1242,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		Assert(TransactionIdIsNormal(builder->xmin));
 		Assert(TransactionIdIsNormal(builder->xmax));
 
-		/* no transactions running now */
-		builder->running.xcnt = 0;
-		builder->running.xmin = InvalidTransactionId;
-		builder->running.xmax = InvalidTransactionId;
-
 		builder->state = SNAPBUILD_CONSISTENT;
+		SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
 
 		ereport(LOG,
 				(errmsg("logical decoding found consistent point at %X/%X",
@@ -1323,30 +1252,29 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state and not building full snapshot */
+	/* b) valid on disk state and not building full snapshot */
 	else if (!builder->building_full_snapshot &&
 			 SnapBuildRestore(builder, lsn))
 	{
 		/* there won't be any state to cleanup */
 		return false;
 	}
-
 	/*
-	 * b) first encounter of a useable xl_running_xacts record. If we had
-	 * found one earlier we would either track running transactions (i.e.
-	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * c) transition from START to BUILDING_SNAPSHOT.
+	 *
+	 * In START state, and a xl_running_xacts record with running xacts is
+	 * encountered.  In that case, switch to BUILDING_SNAPSHOT state, and
+	 * record xl_running_xacts->nextXid.  Once all running xacts have finished
+	 * (i.e. they're all >= nextXid), we have a complete catalog snapshot.  It
+	 * might look that we could use xl_running_xact's ->xids information to
+	 * get there quicker, but that is problematic because transactions marked
+	 * as running, might already have inserted their commit record - it's
+	 * infeasible to change that with locking.
 	 */
-	else if (!builder->running.xcnt)
+	else if (builder->state == SNAPBUILD_START)
 	{
-		int			off;
-
-		/*
-		 * We only care about toplevel xids as those are the ones we
-		 * definitely see in the wal stream. As snapbuild.c tracks committed
-		 * instead of running transactions we don't need to know anything
-		 * about uncommitted subtransactions.
-		 */
+		builder->state = SNAPBUILD_BUILDING_SNAPSHOT;
+		SnapBuildStartNextPhaseAt(builder, running->nextXid);
 
 		/*
 		 * Start with an xmin/xmax that's correct for future, when all the
@@ -1360,59 +1288,57 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		Assert(TransactionIdIsNormal(builder->xmin));
 		Assert(TransactionIdIsNormal(builder->xmax));
 
-		builder->running.xcnt = running->xcnt;
-		builder->running.xcnt_space = running->xcnt;
-		builder->running.xip =
-			MemoryContextAlloc(builder->context,
-							   builder->running.xcnt * sizeof(TransactionId));
-		memcpy(builder->running.xip, running->xids,
-			   builder->running.xcnt * sizeof(TransactionId));
-
-		/* sort so we can do a binary search */
-		qsort(builder->running.xip, builder->running.xcnt,
-			  sizeof(TransactionId), xidComparator);
-
-		builder->running.xmin = builder->running.xip[0];
-		builder->running.xmax = builder->running.xip[running->xcnt - 1];
-
-		/* makes comparisons cheaper later */
-		TransactionIdRetreat(builder->running.xmin);
-		TransactionIdAdvance(builder->running.xmax);
-
-		builder->state = SNAPBUILD_FULL_SNAPSHOT;
-
 		ereport(LOG,
 			(errmsg("logical decoding found initial starting point at %X/%X",
 					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
+			 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
+					   running->xcnt, running->nextXid)));
 
-		/*
-		 * Iterate through all xids, wait for them to finish.
-		 *
-		 * This isn't required for the correctness of decoding, but to allow
-		 * isolationtester to notice that we're currently waiting for
-		 * something.
-		 */
-		for (off = 0; off < builder->running.xcnt; off++)
-		{
-			TransactionId xid = builder->running.xip[off];
+		SnapBuildWaitSnapshot(running, running->nextXid);
+	}
+	/*
+	 * c) transition from BUILDING_SNAPSHOT to FULL_SNAPSHOT.
+	 *
+	 * In BUILDING_SNAPSHOT state, and this xl_running_xacts' oldestRunningXid
+	 * is >= than nextXid from when we switched to BUILDING_SNAPSHOT.  This
+	 * means all transactions starting afterwards have enough information to
+	 * be decoded.  Switch to FULL_SNAPSHOT.
+	 */
+	else if (builder->state == SNAPBUILD_BUILDING_SNAPSHOT &&
+			 TransactionIdPrecedesOrEquals(SnapBuildNextPhaseAt(builder),
+										   running->oldestRunningXid))
+	{
+		builder->state = SNAPBUILD_FULL_SNAPSHOT;
+		SnapBuildStartNextPhaseAt(builder, running->nextXid);
 
-			/*
-			 * Upper layers should prevent that we ever need to wait on
-			 * ourselves. Check anyway, since failing to do so would either
-			 * result in an endless wait or an Assert() failure.
-			 */
-			if (TransactionIdIsCurrentTransactionId(xid))
-				elog(ERROR, "waiting for ourselves");
+		ereport(LOG,
+				(errmsg("logical decoding found initial consistent point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
+						   running->xcnt, running->nextXid)));
 
-			XactLockTableWait(xid, NULL, NULL, XLTW_None);
-		}
+		SnapBuildWaitSnapshot(running, running->nextXid);
+	}
+	/*
+	 * c) transition from FULL_SNAPSHOT to CONSISTENT.
+	 *
+	 * In FULL_SNAPSHOT state (see d) ), and this xl_running_xacts'
+	 * oldestRunningXid is >= than nextXid from when we switched to
+	 * FULL_SNAPSHOT.  This means all transactions that are currently in
+	 * progress have a catalog snapshot, and all their changes have been
+	 * collected.  Switch to CONSISTENT.
+	 */
+	else if (builder->state == SNAPBUILD_FULL_SNAPSHOT &&
+			 TransactionIdPrecedesOrEquals(SnapBuildNextPhaseAt(builder),
+										   running->oldestRunningXid))
+	{
+		builder->state = SNAPBUILD_CONSISTENT;
+		SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
 
-		/* nothing could have built up so far, so don't perform cleanup */
-		return false;
+		ereport(LOG,
+				(errmsg("logical decoding found consistent point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail("There are no old transactions anymore.")));
 	}
 
 	/*
@@ -1421,8 +1347,54 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * records so incremental cleanup can be performed.
 	 */
 	return true;
+
 }
 
+/* ---
+ * Iterate through xids in record, wait for all older than the cutoff to
+ * finish.  Then, if possible, log a new xl_running_xacts record.
+ *
+ * This isn't required for the correctness of decoding, but to:
+ * a) allow isolationtester to notice that we're currently waiting for
+ *    something.
+ * b) log a new xl_running_xacts record where it'd be helpful, without having
+ *    to write for bgwriter or checkpointer.
+ * ---
+ */
+static void
+SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
+{
+	int			off;
+
+	for (off = 0; off < running->xcnt; off++)
+	{
+		TransactionId xid = running->xids[off];
+
+		/*
+		 * Upper layers should prevent that we ever need to wait on
+		 * ourselves. Check anyway, since failing to do so would either
+		 * result in an endless wait or an Assert() failure.
+		 */
+		if (TransactionIdIsCurrentTransactionId(xid))
+			elog(ERROR, "waiting for ourselves");
+
+		if (TransactionIdFollows(xid, cutoff))
+			continue;
+
+		XactLockTableWait(xid, NULL, NULL, XLTW_None);
+	}
+
+	/*
+	 * All transactions we needed to finish finished - try to ensure there is
+	 * another xl_running_xacts record in a timely manner, without having to
+	 * write for bgwriter or checkpointer to log one.  During recovery we
+	 * can't enforce that, so we'll have to wait.
+	 */
+	if (!RecoveryInProgress())
+	{
+		LogStandbySnapshot();
+	}
+}
 
 /* -----------------------------------
  * Snapshot serialization support
@@ -1572,7 +1544,6 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 				 errmsg("could not remove file \"%s\": %m", path)));
 
 	needed_length = sizeof(SnapBuildOnDisk) +
-		sizeof(TransactionId) * builder->running.xcnt_space +
 		sizeof(TransactionId) * builder->committed.xcnt;
 
 	ondisk_c = MemoryContextAllocZero(builder->context, needed_length);
@@ -1591,18 +1562,14 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	ondisk->builder.context = NULL;
 	ondisk->builder.snapshot = NULL;
 	ondisk->builder.reorder = NULL;
-	ondisk->builder.running.xip = NULL;
 	ondisk->builder.committed.xip = NULL;
 
 	COMP_CRC32C(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
-	/* copy running xacts */
-	sz = sizeof(TransactionId) * builder->running.xcnt_space;
-	memcpy(ondisk_c, builder->running.xip, sz);
-	COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
-	ondisk_c += sz;
+	/* there shouldn't be any running xacts */
+	Assert(builder->was_running.was_xcnt == 0);
 
 	/* copy committed xacts */
 	sz = sizeof(TransactionId) * builder->committed.xcnt;
@@ -1762,11 +1729,12 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
 
-	/* restore running xacts information */
-	sz = sizeof(TransactionId) * ondisk.builder.running.xcnt_space;
-	ondisk.builder.running.xip = MemoryContextAllocZero(builder->context, sz);
+	/* restore running xacts (dead, but kept for backward compat) */
+	sz = sizeof(TransactionId) * ondisk.builder.was_running.was_xcnt_space;
+	ondisk.builder.was_running.was_xip =
+		MemoryContextAllocZero(builder->context, sz);
 	pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
-	readBytes = read(fd, ondisk.builder.running.xip, sz);
+	readBytes = read(fd, ondisk.builder.was_running.was_xip, sz);
 	pgstat_report_wait_end();
 	if (readBytes != sz)
 	{
@@ -1776,7 +1744,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 				 errmsg("could not read file \"%s\", read %d of %d: %m",
 						path, readBytes, (int) sz)));
 	}
-	COMP_CRC32C(checksum, ondisk.builder.running.xip, sz);
+	COMP_CRC32C(checksum, ondisk.builder.was_running.was_xip, sz);
 
 	/* restore committed xacts information */
 	sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
@@ -1842,12 +1810,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	ondisk.builder.committed.xip = NULL;
 
-	builder->running.xcnt = ondisk.builder.running.xcnt;
-	if (builder->running.xip)
-		pfree(builder->running.xip);
-	builder->running.xcnt_space = ondisk.builder.running.xcnt_space;
-	builder->running.xip = ondisk.builder.running.xip;
-
 	/* our snapshot is not interesting anymore, build a new one */
 	if (builder->snapshot != NULL)
 	{
@@ -1867,8 +1829,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	return true;
 
 snapshot_not_interesting:
-	if (ondisk.builder.running.xip != NULL)
-		pfree(ondisk.builder.running.xip);
 	if (ondisk.builder.committed.xip != NULL)
 		pfree(ondisk.builder.committed.xip);
 	return false;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 494751d70a..ccb5f831c4 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -20,24 +20,30 @@ typedef enum
 	/*
 	 * Initial state, we can't do much yet.
 	 */
-	SNAPBUILD_START,
+	SNAPBUILD_START = -1,
+
+	/*
+	 * Collecting committed transactions, to build the initial catalog
+	 * snapshot.
+	 */
+	SNAPBUILD_BUILDING_SNAPSHOT = 0,
 
 	/*
 	 * We have collected enough information to decode tuples in transactions
 	 * that started after this.
 	 *
 	 * Once we reached this we start to collect changes. We cannot apply them
-	 * yet because the might be based on transactions that were still running
-	 * when we reached them yet.
+	 * yet, because they might be based on transactions that were still running
+	 * when FULL_SNAPSHOT was reached.
 	 */
-	SNAPBUILD_FULL_SNAPSHOT,
+	SNAPBUILD_FULL_SNAPSHOT = 1,
 
 	/*
-	 * Found a point after hitting built_full_snapshot where all transactions
-	 * that were running at that point finished. Till we reach that we hold
-	 * off calling any commit callbacks.
+	 * Found a point after SNAPBUILD_FULL_SNAPSHOT where all transactions that
+	 * were running at that point finished. Till we reach that we hold off
+	 * calling any commit callbacks.
 	 */
-	SNAPBUILD_CONSISTENT
+	SNAPBUILD_CONSISTENT = 2
 } SnapBuildState;
 
 /* forward declare so we don't have to expose the struct to the public */
@@ -73,9 +79,6 @@ extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 				   TransactionId xid, int nsubxacts,
 				   TransactionId *subxacts);
-extern void SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
-				  TransactionId xid, int nsubxacts,
-				  TransactionId *subxacts);
 extern bool SnapBuildProcessChange(SnapBuild *builder, TransactionId xid,
 					   XLogRecPtr lsn);
 extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
-- 
2.12.0.264.gd6db3f2165.dirty

#84Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#32)
Re: snapbuild woes

On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:

From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

A bit more commentary would be good. What does that protect us against?

I think I explained that in the email. We might export snapshot with
xmin smaller than global xmin otherwise.

Updated commit message with explanation as well.

From ae60b52ae0ca96bc14169cd507f101fbb5dfdf52 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sun, 26 Feb 2017 01:07:33 +0100
Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

Logical decoding snapshot builder may encounter xl_running_xacts with
older xmin than the xmin of the builder. This can happen because
LogStandbySnapshot() sometimes sees already committed transactions as
running (there is difference between "running" in terms for WAL and in
terms of ProcArray). When this happens we must make sure that the xmin
of snapshot builder won't go back otherwise the resulting snapshot would
show some transaction as running even though they have already
committed.
---
src/backend/replication/logical/snapbuild.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ada618d..3e34f75 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1165,7 +1165,8 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
* looking, it's correct and actually more efficient this way since we hit
* fast paths in tqual.c.
*/
-	builder->xmin = running->oldestRunningXid;
+	if (TransactionIdFollowsOrEquals(running->oldestRunningXid, builder->xmin))
+		builder->xmin = running->oldestRunningXid;

/* Remove transactions we don't need to keep track off anymore */
SnapBuildPurgeCommittedTxn(builder);
--
2.7.4

I still don't understand. The snapshot's xmin is solely managed via
xl_running_xacts, so I don't see how the WAL/procarray difference can
play a role here. ->committed isn't pruned before xl_running_xacts
indicates it can be done, so I don't understand your explanation above.

I'd be ok with adding this as paranoia check, but I still don't
understand when it could practically be hit.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#83)
1 attachment(s)
Re: snapbuild woes

On 2017-05-11 14:54:26 -0700, Andres Freund wrote:

On 2017-05-11 14:51:55 -0700, wrote:

On 2017-05-08 00:10:12 -0700, Andres Freund wrote:

I plan to commit the next pending patch after the back branch releases
are cut - it's an invasive fix and the issue doesn't cause corruption
"just" slow slot creation. So it seems better to wait for a few days,
rather than hurry it into the release.

Now that that's done, here's an updated version of that patch. Note the
new logic to trigger xl_running_xact's to be logged at the right spot.
Works well in my testing.

I plan to commit this fairly soon, unless somebody wants a bit more time
to look into it.

Unless somebody protests, I'd like to slightly revise how the on-disk
snapshots are stored on master - given the issues this bug/commit showed
with the current method - but I can see one could argue that that should
rather be done next release.

As usual I forgot my attachement.

This patch also seems to offer a way to do your 0005 in, possibly, more
efficient manner. We don't ever need to assume a transaction is
timetravelling anymore. Could you check whether the attached, hasty,
patch resolves the performance problems you measured? Also, do you have
a "reference" workload for that?

Regards,

Andres

Attachments:

another-approach.difftext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0f2dcb84be..4ddd10fcf0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -929,21 +929,31 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 {
 	int			nxact;
 
-	bool		forced_timetravel = false;
+	bool		needs_snapshot = false;
+	bool		needs_timetravel = false;
+
 	bool		sub_needs_timetravel = false;
-	bool		top_needs_timetravel = false;
 
 	TransactionId xmax = xid;
 
+	if (builder->state == SNAPBUILD_START)
+		return;
+
+
 	/*
-	 * If we couldn't observe every change of a transaction because it was
-	 * already running at the point we started to observe we have to assume it
-	 * made catalog changes.
-	 *
-	 * This has the positive benefit that we afterwards have enough
-	 * information to build an exportable snapshot that's usable by pg_dump et
-	 * al.
+	 * Transactions preceding BUILDING_SNAPSHOT will neither be decoded, nor
+	 * will it be part of a snapshot.  So we don't even need to record
+	 * anything.
 	 */
+	if (builder->state == SNAPBUILD_BUILDING_SNAPSHOT &&
+		TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder)))
+	{
+		/* ensure that only commits after this are getting replayed */
+		if (builder->start_decoding_at <= lsn)
+			builder->start_decoding_at = lsn + 1;
+		return;
+	}
+
 	if (builder->state < SNAPBUILD_CONSISTENT)
 	{
 		/* ensure that only commits after this are getting replayed */
@@ -951,12 +961,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			builder->start_decoding_at = lsn + 1;
 
 		/*
-		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
-		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * If we're building an exportable snapshot, force recording of the
+		 * xid, even if the transaction doesn't modify the catalog.
 		 */
-		forced_timetravel = true;
-		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+		if (builder->building_full_snapshot)
+		{
+			needs_timetravel = true;
+		}
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -964,23 +975,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		TransactionId subxid = subxacts[nxact];
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
+			needs_snapshot = true;
 
 			elog(DEBUG1, "found subtransaction %u:%u with catalog changes.",
 				 xid, subxid);
@@ -990,21 +991,26 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state, even
+		 * if not catalog modifying.
+		 */
+		else if (needs_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
-	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	/* if top-level modifies catalog, it'll need a snapshot */
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
-
-		top_needs_timetravel = true;
+		needs_snapshot = true;
+		needs_timetravel = true;
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
 	else if (sub_needs_timetravel)
@@ -1012,23 +1018,38 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (needs_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
+
+	if (!needs_timetravel)
+	{
+		/* record that we cannot export a general snapshot anymore */
+		builder->committed.includes_all_transactions = false;
+	}
+
+	Assert(!needs_snapshot || needs_timetravel);
+
+	/*
+	 * Adjust xmax of the snapshot builder, we only do that for committed,
+	 * catalog modifying, transactions, everything else isn't interesting
+	 * for us since we'll never look at the respective rows.
+	 */
+	if (needs_timetravel &&
+		(!TransactionIdIsValid(builder->xmax) ||
+		 TransactionIdFollowsOrEquals(xmax, builder->xmax)))
+	{
+		builder->xmax = xmax;
+		TransactionIdAdvance(builder->xmax);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
-	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
+	if (needs_snapshot)
 	{
 		/*
-		 * Adjust xmax of the snapshot builder, we only do that for committed,
-		 * catalog modifying, transactions, everything else isn't interesting
-		 * for us since we'll never look at the respective rows.
-		 */
-		if (!TransactionIdIsValid(builder->xmax) ||
-			TransactionIdFollowsOrEquals(xmax, builder->xmax))
-		{
-			builder->xmax = xmax;
-			TransactionIdAdvance(builder->xmax);
-		}
-
-		/*
 		 * If we haven't built a complete snapshot yet there's no need to hand
 		 * it out, it wouldn't (and couldn't) be used anyway.
 		 */
@@ -1059,11 +1080,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* add a new Snapshot to all currently running transactions */
 		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
-	else
-	{
-		/* record that we cannot export a general snapshot anymore */
-		builder->committed.includes_all_transactions = false;
-	}
 }
 
 
#86Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#85)
Re: snapbuild woes

On 12/05/17 03:31, Andres Freund wrote:

On 2017-05-11 14:54:26 -0700, Andres Freund wrote:

On 2017-05-11 14:51:55 -0700, wrote:

On 2017-05-08 00:10:12 -0700, Andres Freund wrote:

I plan to commit the next pending patch after the back branch releases
are cut - it's an invasive fix and the issue doesn't cause corruption
"just" slow slot creation. So it seems better to wait for a few days,
rather than hurry it into the release.

Now that that's done, here's an updated version of that patch. Note the
new logic to trigger xl_running_xact's to be logged at the right spot.
Works well in my testing.

I plan to commit this fairly soon, unless somebody wants a bit more time
to look into it.

Unless somebody protests, I'd like to slightly revise how the on-disk
snapshots are stored on master - given the issues this bug/commit showed
with the current method - but I can see one could argue that that should
rather be done next release.

As usual I forgot my attachement.

This patch also seems to offer a way to do your 0005 in, possibly, more
efficient manner. We don't ever need to assume a transaction is
timetravelling anymore. Could you check whether the attached, hasty,
patch resolves the performance problems you measured? Also, do you have
a "reference" workload for that?

Hmm, well it helps but actually now that we don't track individual
running transactions anymore it got much less effective (my version of
0005 does as well).

The example workload I test with is:
session 1: open transaction, do a write, keep it open
session 2: pgbench -M simple -N -c 10 -P 1 -T 5
session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
session 2: pgbench -M simple -N -c 10 -P 1 -T 20
session 1: commit

And wait for session 3 to finish slot creation, takes about 20 mins on
my laptop without patches, minute and half with your patches for 0004
and 0005 (or with your 0004 and my 0005) and about 2s with my original
0004 and 0005.

What makes it slow is the constant qsorting IIRC.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#86)
Re: snapbuild woes

On 12/05/17 10:57, Petr Jelinek wrote:

On 12/05/17 03:31, Andres Freund wrote:

On 2017-05-11 14:54:26 -0700, Andres Freund wrote:

On 2017-05-11 14:51:55 -0700, wrote:

On 2017-05-08 00:10:12 -0700, Andres Freund wrote:

I plan to commit the next pending patch after the back branch releases
are cut - it's an invasive fix and the issue doesn't cause corruption
"just" slow slot creation. So it seems better to wait for a few days,
rather than hurry it into the release.

Now that that's done, here's an updated version of that patch. Note the
new logic to trigger xl_running_xact's to be logged at the right spot.
Works well in my testing.

I plan to commit this fairly soon, unless somebody wants a bit more time
to look into it.

Unless somebody protests, I'd like to slightly revise how the on-disk
snapshots are stored on master - given the issues this bug/commit showed
with the current method - but I can see one could argue that that should
rather be done next release.

As usual I forgot my attachement.

This patch also seems to offer a way to do your 0005 in, possibly, more
efficient manner. We don't ever need to assume a transaction is
timetravelling anymore. Could you check whether the attached, hasty,
patch resolves the performance problems you measured? Also, do you have
a "reference" workload for that?

Hmm, well it helps but actually now that we don't track individual
running transactions anymore it got much less effective (my version of
0005 does as well).

The example workload I test with is:
session 1: open transaction, do a write, keep it open
session 2: pgbench -M simple -N -c 10 -P 1 -T 5
session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
session 2: pgbench -M simple -N -c 10 -P 1 -T 20
session 1: commit

And wait for session 3 to finish slot creation, takes about 20 mins on
my laptop without patches, minute and half with your patches for 0004
and 0005 (or with your 0004 and my 0005) and about 2s with my original
0004 and 0005.

What makes it slow is the constant qsorting IIRC.

Btw I still think that we should pursue the approach you proposed. I
think we can deal with the qsorting in some other ways (ordered insert
perhaps?) later. What you propose fixes the correctness, makes
performance less awful in the above workload and also makes the
snapbuild bit easier to read.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#86)
Re: snapbuild woes

On 2017-05-12 10:57:55 +0200, Petr Jelinek wrote:

Hmm, well it helps but actually now that we don't track individual
running transactions anymore it got much less effective (my version of
0005 does as well).

The example workload I test with is:
session 1: open transaction, do a write, keep it open
session 2: pgbench -M simple -N -c 10 -P 1 -T 5
session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
session 2: pgbench -M simple -N -c 10 -P 1 -T 20
session 1: commit

And wait for session 3 to finish slot creation, takes about 20 mins on
my laptop without patches, minute and half with your patches for 0004
and 0005 (or with your 0004 and my 0005) and about 2s with my original
0004 and 0005.

Is that with assertions enabled or not? With assertions all the time
post patches seems to be spent in some Asserts in reorderbuffer.c,
without it takes less than a second for me here.

I'm appylying these now.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Andres Freund (#88)
Re: snapbuild woes

On 13/05/17 22:23, Andres Freund wrote:

On 2017-05-12 10:57:55 +0200, Petr Jelinek wrote:

Hmm, well it helps but actually now that we don't track individual
running transactions anymore it got much less effective (my version of
0005 does as well).

The example workload I test with is:
session 1: open transaction, do a write, keep it open
session 2: pgbench -M simple -N -c 10 -P 1 -T 5
session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
session 2: pgbench -M simple -N -c 10 -P 1 -T 20
session 1: commit

And wait for session 3 to finish slot creation, takes about 20 mins on
my laptop without patches, minute and half with your patches for 0004
and 0005 (or with your 0004 and my 0005) and about 2s with my original
0004 and 0005.

Is that with assertions enabled or not? With assertions all the time
post patches seems to be spent in some Asserts in reorderbuffer.c,
without it takes less than a second for me here.

Right you are, I always forget to switch that off before benchmarks.

I'm appylying these now.

Cool. Just for posterity, this also fixes the issue number 3 as the
switch to consistent state is done purely based on xl_running_xacts and
not decoded commits/aborts.

So all done here, thanks!

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#89)
Re: snapbuild woes

On 2017-05-14 14:54:37 +0200, Petr Jelinek wrote:

On 13/05/17 22:23, Andres Freund wrote:

And wait for session 3 to finish slot creation, takes about 20 mins on
my laptop without patches, minute and half with your patches for 0004
and 0005 (or with your 0004 and my 0005) and about 2s with my original
0004 and 0005.

Is that with assertions enabled or not? With assertions all the time
post patches seems to be spent in some Asserts in reorderbuffer.c,
without it takes less than a second for me here.

Right you are, I always forget to switch that off before benchmarks.

Phew ;)

I'm appylying these now.

Cool. Just for posterity, this also fixes the issue number 3 as the
switch to consistent state is done purely based on xl_running_xacts and
not decoded commits/aborts.

Cool. Although I'm still not convinced, as noted somewhere in this
thread, that it actually did much to start with ;)

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Antonin Houska
ah@cybertec.at
In reply to: Andres Freund (#39)
Re: snapbuild woes

Andres Freund <andres@anarazel.de> wrote:

Looking at 0001:
- GetOldestSafeDecodingTransactionId() only guarantees to return an xid
safe for decoding (note how procArray->replication_slot_catalog_xmin
is checked), not one for the initial snapshot - so afaics this whole
exercise doesn't guarantee much so far.

I happen to use CreateInitDecodingContext() in an extension, so I had to think
what the new argumen exactly means (as for the incompatibility between PG
9.6.2 and 9.6.3, I suppose preprocessor directives can handle it).

One thing I'm failing to understand is: if TRUE is passed for
need_full_snapshot, then slot->effective_xmin receives the result of

GetOldestSafeDecodingTransactionId(need_full_snapshot)

but this does include "catalog xmin".

If the value returned is determined by an existing slot which has valid
effective_catalog_xmin and invalid effective_xmin (i.e. that slot only
protects catalog tables from VACUUM but not the regular ones), then the new
slot will think it (i.e. the new slot) protects even non-catalog tables, but
that's no true.

Shouldn't the xmin_horizon be computed by this call instead?

GetOldestSafeDecodingTransactionId(!need_full_snapshot);

(If so, I think "considerCatalog" argument name would be clearer than
"catalogOnly".)

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92Andres Freund
andres@anarazel.de
In reply to: Antonin Houska (#91)
Re: snapbuild woes

On 2017-06-09 09:25:34 +0200, Antonin Houska wrote:

Andres Freund <andres@anarazel.de> wrote:

Looking at 0001:
- GetOldestSafeDecodingTransactionId() only guarantees to return an xid
safe for decoding (note how procArray->replication_slot_catalog_xmin
is checked), not one for the initial snapshot - so afaics this whole
exercise doesn't guarantee much so far.

I happen to use CreateInitDecodingContext() in an extension, so I had to think
what the new argumen exactly means (as for the incompatibility between PG
9.6.2 and 9.6.3, I suppose preprocessor directives can handle it).

One thing I'm failing to understand is: if TRUE is passed for
need_full_snapshot, then slot->effective_xmin receives the result of

GetOldestSafeDecodingTransactionId(need_full_snapshot)

but this does include "catalog xmin".

If the value returned is determined by an existing slot which has valid
effective_catalog_xmin and invalid effective_xmin (i.e. that slot only
protects catalog tables from VACUUM but not the regular ones), then the new
slot will think it (i.e. the new slot) protects even non-catalog tables, but
that's no true.

Shouldn't the xmin_horizon be computed by this call instead?

GetOldestSafeDecodingTransactionId(!need_full_snapshot);

(If so, I think "considerCatalog" argument name would be clearer than
"catalogOnly".)

Good catch. Pushing a fix. Afaict it's luckily inconsequential atm
because fo the way we wait for concurrent snapshots when creating a
slot. But it obviously nevertheless needs tobe fixed.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers