SimpleLruTruncate() mutual exclusion

Started by Noah Mischalmost 7 years ago14 messages

noah@leadboat.com

almost 7 years ago

1 attachment(s)

I'm forking this thread from
/messages/by-id/flat/20190202083822.GC32531@gust.leadboat.com, which
reported a race condition involving the "apparent wraparound" safety check in
SimpleLruTruncate():

On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:

1. The result of the test is valid only until we release the SLRU ControlLock,
which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
segments for deletion. Once we release that lock, latest_page_number can
advance. This creates a TOCTOU race condition, allowing excess deletion:

[local] test=# table trunc_clog_concurrency ;
ERROR: could not access status of transaction 2149484247
DETAIL: Could not open file "pg_xact/0801": No such file or directory.

Fixes are available:

b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than
AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
checkpoint, or in the startup process. Hence, also arrange for only one
backend to call SimpleLruTruncate(AsyncCtl) at a time.

More specifically, restrict vac_update_datfrozenxid() to one backend per
database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
backend per cluster. This, attached, was rather straightforward.

I wonder about performance in a database with millions of small relations,
particularly considering my intent to back-patch this. In such databases,
vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two
things work in our favor. First, vac_update_datfrozenxid() runs once per
VACUUM command, not once per relation. Second, Autovacuum has this logic:

* ... we skip
* this if (1) we found no work to do and (2) we skipped at least one
* table due to concurrent autovacuum activity. In that case, the other
* worker has already done it, or will do so when it finishes.
*/
if (did_vacuum || !found_concurrent_worker)
vac_update_datfrozenxid();

That makes me relatively unworried. I did consider some alternatives:

- Run vac_update_datfrozenxid()'s pg_class scan before taking a lock. If we
find the need for pg_database updates, take the lock and scan pg_class again
to get final numbers. This doubles the work in cases that end up taking the
lock, so I'm not betting it being a net win.

- Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other
backend is already waiting. This is similar to Autovacuum's
found_concurrent_worker test. It is tempting. I'm not proposing it,
because it changes the states possible when manual VACUUM completes. Today,
you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid
values. If manual VACUUM could skip vac_update_datfrozenxid() this way,
datfrozenxid could lag until some concurrent VACUUM finishes.

Thanks,
nm

Attachments:

wrap-limits-mutex-v1.patchtext/plain; charset=us-asciiDownload

commit 65928b9 (HEAD)
Author:     Noah Misch <noah@leadboat.com>
AuthorDate: Sun Feb 17 23:26:45 2019 -0800
Commit:     Noah Misch <noah@leadboat.com>
CommitDate: Sun Feb 17 23:26:45 2019 -0800

    Prevent concurrent SimpleLruTruncate() for any given SLRU.
    
    The SimpleLruTruncate() header comment states the new coding rule.  This
    closes race conditions that presented rare opportunities for data loss.
    Symptoms and recommendations to users mirror the previous commit[1].
    Back-patch to 9.4 (all supported versions).
    
    Reviewed by TBD.
    
    Discussion: https://postgr.es/m/TBD
    
    [1] slru-truncate-modulo-v2.patch of https://postgr.es/m/20190217040913.GA1305059@rfd.leadboat.com

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5c1408b..b7ddc73 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -849,7 +849,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="63"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1044,6 +1044,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          the oldest transaction id available to it.</entry>
         </row>
         <row>
+         <entry><literal>WrapLimitsVacuumLock</literal></entry>
+         <entry>Waiting to update limits on transaction id and multixact
+         consumption.</entry>
+        </row>
+        <row>
+         <entry><literal>AsyncQueueTailLock</literal></entry>
+         <entry>Waiting to update limit on notification message
+         storage.</entry>
+        </row>
+        <row>
          <entry><literal>clog</literal></entry>
          <entry>Waiting for I/O on a clog (transaction status) buffer.</entry>
         </row>
@@ -1128,7 +1138,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          counters during Parallel Hash plan execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="10"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
@@ -1137,6 +1147,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to extend a relation.</entry>
         </row>
         <row>
+         <entry><literal>frozen IDs</literal></entry>
+         <entry>Waiting to
+         update <structname>pg_database</structname>.<structfield>datfrozenxid</structfield>
+         and <structname>pg_database</structname>.<structfield>datminmxid</structfield>.</entry>
+        </row>
+        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352..8c9d5d8 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1163,6 +1163,14 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 
 /*
  * Remove all segments before the one holding the passed page number
+ *
+ * All SLRUs prevent concurrent calls to this function, either with an LWLock
+ * or by calling it only as part of a checkpoint.  Mutual exclusion must begin
+ * before computing cutoffPage.  Mutual exclusion must end after any limit
+ * update that would permit other backends to write fresh data into the
+ * segment immediately preceding the one containing cutoffPage.  Otherwise,
+ * when the SLRU is quite full, SimpleLruTruncate() might delete that segment
+ * after it has accrued freshly-written data.
  */
 void
 SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index cbc6129..a480dd5 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -347,8 +347,8 @@ ExtendSUBTRANS(TransactionId newestXact)
 /*
  * Remove all SUBTRANS segments before the one holding the passed transaction ID
  *
- * This is normally called during checkpoint, with oldestXact being the
- * oldest TransactionXmin of any running transaction.
+ * oldestXact is the oldest TransactionXmin of any running transaction.  This
+ * is called only during checkpoint.
  */
 void
 TruncateSUBTRANS(TransactionId oldestXact)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 5a7ee0d..66e888c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -223,19 +223,22 @@ typedef struct QueueBackendStatus
 /*
  * Shared memory state for LISTEN/NOTIFY (excluding its SLRU stuff)
  *
- * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+ * The AsyncQueueControl structure is protected by the AsyncQueueLock and
+ * AsyncQueueTailLock.
  *
- * When holding the lock in SHARED mode, backends may only inspect their own
- * entries as well as the head and tail pointers. Consequently we can allow a
- * backend to update its own record while holding only SHARED lock (since no
- * other backend will inspect it).
+ * When holding AsyncQueueLock in SHARED mode, backends may only inspect their
+ * own entries as well as the head and tail pointers. Consequently we can
+ * allow a backend to update its own record while holding only SHARED lock
+ * (since no other backend will inspect it).
  *
- * When holding the lock in EXCLUSIVE mode, backends can inspect the entries
- * of other backends and also change the head and tail pointers.
+ * When holding AsyncQueueLock in EXCLUSIVE mode, backends can inspect the
+ * entries of other backends and also change the head pointer. When holding
+ * both AsyncQueueLock and AsyncQueueTailLock in EXCLUSIVE mode, backends can
+ * change the tail pointer.
  *
  * AsyncCtlLock is used as the control lock for the pg_notify SLRU buffers.
- * In order to avoid deadlocks, whenever we need both locks, we always first
- * get AsyncQueueLock and then AsyncCtlLock.
+ * In order to avoid deadlocks, whenever we need multiple locks, we first get
+ * AsyncQueueTailLock, then AsyncQueueLock, and lastly AsyncCtlLock.
  *
  * Each backend uses the backend[] array entry with index equal to its
  * BackendId (which can range from 1 to MaxBackends).  We rely on this to make
@@ -2010,6 +2013,10 @@ asyncQueueAdvanceTail(void)
 	int			newtailpage;
 	int			boundary;
 
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(AsyncQueueTailLock, LW_EXCLUSIVE);
+
+	/* Compute the new tail. */
 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
 	min = QUEUE_HEAD;
 	for (i = 1; i <= MaxBackends; i++)
@@ -2018,7 +2025,6 @@ asyncQueueAdvanceTail(void)
 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
 	}
 	oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL);
-	QUEUE_TAIL = min;
 	LWLockRelease(AsyncQueueLock);
 
 	/*
@@ -2038,6 +2044,17 @@ asyncQueueAdvanceTail(void)
 		 */
 		SimpleLruTruncate(AsyncCtl, newtailpage);
 	}
+
+	/*
+	 * Advertise the new tail.  This changes asyncQueueIsFull()'s verdict for
+	 * the segment immediately prior to the new tail, allowing fresh data into
+	 * that segment.
+	 */
+	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+	QUEUE_TAIL = min;
+	LWLockRelease(AsyncQueueLock);
+
+	LWLockRelease(AsyncQueueTailLock);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e91df21..326f48b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1225,6 +1225,14 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
+	 * Restrict this task to one backend per database.  This avoids race
+	 * conditions that would move datfrozenxid or datminmxid backward or call
+	 * vac_truncate_clog() with a datfrozenxid preceding a datfrozenxid passed
+	 * to an earlier vac_truncate_clog() call.
+	 */
+	LockDatabaseFrozenIds(ExclusiveLock);
+
+	/*
 	 * Initialize the "min" calculation with GetOldestXmin, which is a
 	 * reasonable approximation to the minimum relfrozenxid for not-yet-
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
@@ -1388,6 +1396,9 @@ vac_truncate_clog(TransactionId frozenXID,
 	bool		bogus = false;
 	bool		frozenAlreadyWrapped = false;
 
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(WrapLimitsVacuumLock, LW_EXCLUSIVE);
+
 	/* init oldest datoids to sync with my frozenXID/minMulti values */
 	oldestxid_datoid = MyDatabaseId;
 	minmulti_datoid = MyDatabaseId;
@@ -1497,6 +1508,8 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+
+	LWLockRelease(WrapLimitsVacuumLock);
 }
 
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index e688ba8..4873835 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -458,6 +458,21 @@ UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockDatabaseFrozenIds
+ *
+ * This allows one backend per database to execute vac_update_datfrozenxid().
+ */
+void
+LockDatabaseFrozenIds(LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_DATABASE_FROZEN_IDS(tag, MyDatabaseId);
+
+	(void) LockAcquire(&tag, lockmode, false, false);
+}
+
+/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -1060,6 +1075,11 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
+		case LOCKTAG_DATABASE_FROZEN_IDS:
+			appendStringInfo(buf,
+							 _("pg_database.datfrozenxid of database %u"),
+							 tag->locktag_field1);
+			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843..d517023 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,5 @@ MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 LogicalRepWorkerLock				43
 CLogTruncationLock					44
+WrapLimitsVacuumLock				45
+AsyncQueueTailLock					46
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index ffd1970..cbf0cab 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -26,6 +26,7 @@
 const char *const LockTagTypeNames[] = {
 	"relation",
 	"extend",
+	"frozen IDs",
 	"page",
 	"tuple",
 	"transactionid",
@@ -245,6 +246,17 @@ pg_lock_status(PG_FUNCTION_ARGS)
 				nulls[8] = true;
 				nulls[9] = true;
 				break;
+			case LOCKTAG_DATABASE_FROZEN_IDS:
+				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
+				nulls[2] = true;
+				nulls[3] = true;
+				nulls[4] = true;
+				nulls[5] = true;
+				nulls[6] = true;
+				nulls[7] = true;
+				nulls[8] = true;
+				nulls[9] = true;
+				break;
 			case LOCKTAG_PAGE:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 3d705fa..f113f55 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -59,6 +59,9 @@ extern bool ConditionalLockRelationForExtension(Relation relation,
 									LOCKMODE lockmode);
 extern int	RelationExtensionLockWaiterCount(Relation relation);
 
+/* Lock to recompute pg_database.datfrozenxid in the current database */
+extern void LockDatabaseFrozenIds(LOCKMODE lockmode);
+
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 16b927c..f3b6ef4 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -141,6 +141,8 @@ typedef enum LockTagType
 	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
 	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
+	LOCKTAG_DATABASE_FROZEN_IDS,	/* pg_database.datfrozenxid */
+	/* ID info for frozen IDs is DB OID */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
 	LOCKTAG_TUPLE,				/* one physical tuple */
@@ -206,6 +208,14 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
+#define SET_LOCKTAG_DATABASE_FROZEN_IDS(locktag,dboid) \
+	((locktag).locktag_field1 = (dboid), \
+	 (locktag).locktag_field2 = 0, \
+	 (locktag).locktag_field3 = 0, \
+	 (locktag).locktag_field4 = 0, \
+	 (locktag).locktag_type = LOCKTAG_DATABASE_FROZEN_IDS, \
+	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
+
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \

Noah Misch

noah@leadboat.com

over 6 years ago

In reply to: Noah Misch (#1)

1 attachment(s)

Re: SimpleLruTruncate() mutual exclusion

On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:

I'm forking this thread from
/messages/by-id/flat/20190202083822.GC32531@gust.leadboat.com, which
reported a race condition involving the "apparent wraparound" safety check in
SimpleLruTruncate():

On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:

1. The result of the test is valid only until we release the SLRU ControlLock,
which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
segments for deletion. Once we release that lock, latest_page_number can
advance. This creates a TOCTOU race condition, allowing excess deletion:

[local] test=# table trunc_clog_concurrency ;
ERROR: could not access status of transaction 2149484247
DETAIL: Could not open file "pg_xact/0801": No such file or directory.

Fixes are available:

b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than
AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
checkpoint, or in the startup process. Hence, also arrange for only one
backend to call SimpleLruTruncate(AsyncCtl) at a time.

More specifically, restrict vac_update_datfrozenxid() to one backend per
database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
backend per cluster. This, attached, was rather straightforward.

Rebased. The conflicts were limited to comments and documentation.

Show quoted text

I wonder about performance in a database with millions of small relations,
particularly considering my intent to back-patch this. In such databases,
vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two
things work in our favor. First, vac_update_datfrozenxid() runs once per
VACUUM command, not once per relation. Second, Autovacuum has this logic:

* ... we skip
* this if (1) we found no work to do and (2) we skipped at least one
* table due to concurrent autovacuum activity. In that case, the other
* worker has already done it, or will do so when it finishes.
*/
if (did_vacuum || !found_concurrent_worker)
vac_update_datfrozenxid();

That makes me relatively unworried. I did consider some alternatives:

- Run vac_update_datfrozenxid()'s pg_class scan before taking a lock. If we
find the need for pg_database updates, take the lock and scan pg_class again
to get final numbers. This doubles the work in cases that end up taking the
lock, so I'm not betting it being a net win.

- Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other
backend is already waiting. This is similar to Autovacuum's
found_concurrent_worker test. It is tempting. I'm not proposing it,
because it changes the states possible when manual VACUUM completes. Today,
you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid
values. If manual VACUUM could skip vac_update_datfrozenxid() this way,
datfrozenxid could lag until some concurrent VACUUM finishes.

Attachments:

wrap-limits-mutex-v2.patchtext/plain; charset=us-asciiDownload

commit 647b5e3 (HEAD)
Author:     Noah Misch <noah@leadboat.com>
AuthorDate: Fri Jun 28 10:00:53 2019 -0700
Commit:     Noah Misch <noah@leadboat.com>
CommitDate: Fri Jun 28 10:00:53 2019 -0700

    Prevent concurrent SimpleLruTruncate() for any given SLRU.
    
    The SimpleLruTruncate() header comment states the new coding rule.  This
    closes race conditions that presented rare opportunities for data loss.
    Symptoms and recommendations to users mirror the previous commit[1].
    Back-patch to 9.4 (all supported versions).
    
    Reviewed by TBD.
    
    Discussion: https://postgr.es/m/TBD
    
    [1] slru-truncate-modulo-v2.patch of https://postgr.es/m/20190217040913.GA1305059@rfd.leadboat.com

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf72d0c..dc82008 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -885,7 +885,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="66"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1080,6 +1080,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          the oldest transaction id available to it.</entry>
         </row>
         <row>
+         <entry><literal>WrapLimitsVacuumLock</literal></entry>
+         <entry>Waiting to update limits on transaction id and multixact
+         consumption.</entry>
+        </row>
+        <row>
+         <entry><literal>AsyncQueueTailLock</literal></entry>
+         <entry>Waiting to update limit on notification message
+         storage.</entry>
+        </row>
+        <row>
          <entry><literal>clog</literal></entry>
          <entry>Waiting for I/O on a clog (transaction status) buffer.</entry>
         </row>
@@ -1169,7 +1179,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          counters during Parallel Hash plan execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="10"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
@@ -1178,6 +1188,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to extend a relation.</entry>
         </row>
         <row>
+         <entry><literal>frozen IDs</literal></entry>
+         <entry>Waiting to
+         update <structname>pg_database</structname>.<structfield>datfrozenxid</structfield>
+         and <structname>pg_database</structname>.<structfield>datminmxid</structfield>.</entry>
+        </row>
+        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 2dbc65b..a3094b4 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1169,6 +1169,14 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 
 /*
  * Remove all segments before the one holding the passed page number
+ *
+ * All SLRUs prevent concurrent calls to this function, either with an LWLock
+ * or by calling it only as part of a checkpoint.  Mutual exclusion must begin
+ * before computing cutoffPage.  Mutual exclusion must end after any limit
+ * update that would permit other backends to write fresh data into the
+ * segment immediately preceding the one containing cutoffPage.  Otherwise,
+ * when the SLRU is quite full, SimpleLruTruncate() might delete that segment
+ * after it has accrued freshly-written data.
  */
 void
 SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index e667fd0..8159427 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -349,8 +349,8 @@ ExtendSUBTRANS(TransactionId newestXact)
 /*
  * Remove all SUBTRANS segments before the one holding the passed transaction ID
  *
- * This is normally called during checkpoint, with oldestXact being the
- * oldest TransactionXmin of any running transaction.
+ * oldestXact is the oldest TransactionXmin of any running transaction.  This
+ * is called only during checkpoint.
  */
 void
 TruncateSUBTRANS(TransactionId oldestXact)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 738e6ec..365cdd9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -223,19 +223,22 @@ typedef struct QueueBackendStatus
 /*
  * Shared memory state for LISTEN/NOTIFY (excluding its SLRU stuff)
  *
- * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+ * The AsyncQueueControl structure is protected by the AsyncQueueLock and
+ * AsyncQueueTailLock.
  *
- * When holding the lock in SHARED mode, backends may only inspect their own
- * entries as well as the head and tail pointers. Consequently we can allow a
- * backend to update its own record while holding only SHARED lock (since no
- * other backend will inspect it).
+ * When holding AsyncQueueLock in SHARED mode, backends may only inspect their
+ * own entries as well as the head and tail pointers. Consequently we can
+ * allow a backend to update its own record while holding only SHARED lock
+ * (since no other backend will inspect it).
  *
- * When holding the lock in EXCLUSIVE mode, backends can inspect the entries
- * of other backends and also change the head and tail pointers.
+ * When holding AsyncQueueLock in EXCLUSIVE mode, backends can inspect the
+ * entries of other backends and also change the head pointer. When holding
+ * both AsyncQueueLock and AsyncQueueTailLock in EXCLUSIVE mode, backends can
+ * change the tail pointer.
  *
  * AsyncCtlLock is used as the control lock for the pg_notify SLRU buffers.
- * In order to avoid deadlocks, whenever we need both locks, we always first
- * get AsyncQueueLock and then AsyncCtlLock.
+ * In order to avoid deadlocks, whenever we need multiple locks, we first get
+ * AsyncQueueTailLock, then AsyncQueueLock, and lastly AsyncCtlLock.
  *
  * Each backend uses the backend[] array entry with index equal to its
  * BackendId (which can range from 1 to MaxBackends).  We rely on this to make
@@ -2010,6 +2013,10 @@ asyncQueueAdvanceTail(void)
 	int			newtailpage;
 	int			boundary;
 
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(AsyncQueueTailLock, LW_EXCLUSIVE);
+
+	/* Compute the new tail. */
 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
 	min = QUEUE_HEAD;
 	for (i = 1; i <= MaxBackends; i++)
@@ -2018,7 +2025,6 @@ asyncQueueAdvanceTail(void)
 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
 	}
 	oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL);
-	QUEUE_TAIL = min;
 	LWLockRelease(AsyncQueueLock);
 
 	/*
@@ -2038,6 +2044,17 @@ asyncQueueAdvanceTail(void)
 		 */
 		SimpleLruTruncate(AsyncCtl, newtailpage);
 	}
+
+	/*
+	 * Advertise the new tail.  This changes asyncQueueIsFull()'s verdict for
+	 * the segment immediately prior to the new tail, allowing fresh data into
+	 * that segment.
+	 */
+	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+	QUEUE_TAIL = min;
+	LWLockRelease(AsyncQueueLock);
+
+	LWLockRelease(AsyncQueueTailLock);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e7b379d..c81d371 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1281,6 +1281,14 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
+	 * Restrict this task to one backend per database.  This avoids race
+	 * conditions that would move datfrozenxid or datminmxid backward or call
+	 * vac_truncate_clog() with a datfrozenxid preceding a datfrozenxid passed
+	 * to an earlier vac_truncate_clog() call.
+	 */
+	LockDatabaseFrozenIds(ExclusiveLock);
+
+	/*
 	 * Initialize the "min" calculation with GetOldestXmin, which is a
 	 * reasonable approximation to the minimum relfrozenxid for not-yet-
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
@@ -1469,6 +1477,9 @@ vac_truncate_clog(TransactionId frozenXID,
 	bool		bogus = false;
 	bool		frozenAlreadyWrapped = false;
 
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(WrapLimitsVacuumLock, LW_EXCLUSIVE);
+
 	/* init oldest datoids to sync with my frozenXID/minMulti values */
 	oldestxid_datoid = MyDatabaseId;
 	minmulti_datoid = MyDatabaseId;
@@ -1578,6 +1589,8 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+
+	LWLockRelease(WrapLimitsVacuumLock);
 }
 
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index f838b0f..1b898f0 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -461,6 +461,21 @@ UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockDatabaseFrozenIds
+ *
+ * This allows one backend per database to execute vac_update_datfrozenxid().
+ */
+void
+LockDatabaseFrozenIds(LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_DATABASE_FROZEN_IDS(tag, MyDatabaseId);
+
+	(void) LockAcquire(&tag, lockmode, false, false);
+}
+
+/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -1100,6 +1115,11 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
+		case LOCKTAG_DATABASE_FROZEN_IDS:
+			appendStringInfo(buf,
+							 _("pg_database.datfrozenxid of database %u"),
+							 tag->locktag_field1);
+			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843..d517023 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,5 @@ MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 LogicalRepWorkerLock				43
 CLogTruncationLock					44
+WrapLimitsVacuumLock				45
+AsyncQueueTailLock					46
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index ffd1970..cbf0cab 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -26,6 +26,7 @@
 const char *const LockTagTypeNames[] = {
 	"relation",
 	"extend",
+	"frozen IDs",
 	"page",
 	"tuple",
 	"transactionid",
@@ -245,6 +246,17 @@ pg_lock_status(PG_FUNCTION_ARGS)
 				nulls[8] = true;
 				nulls[9] = true;
 				break;
+			case LOCKTAG_DATABASE_FROZEN_IDS:
+				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
+				nulls[2] = true;
+				nulls[3] = true;
+				nulls[4] = true;
+				nulls[5] = true;
+				nulls[6] = true;
+				nulls[7] = true;
+				nulls[8] = true;
+				nulls[9] = true;
+				break;
 			case LOCKTAG_PAGE:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 099e18f..3f42000 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -59,6 +59,9 @@ extern bool ConditionalLockRelationForExtension(Relation relation,
 												LOCKMODE lockmode);
 extern int	RelationExtensionLockWaiterCount(Relation relation);
 
+/* Lock to recompute pg_database.datfrozenxid in the current database */
+extern void LockDatabaseFrozenIds(LOCKMODE lockmode);
+
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 986bb64..5dc7f87 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -139,6 +139,7 @@ typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
 	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
+	LOCKTAG_DATABASE_FROZEN_IDS,	/* pg_database.datfrozenxid */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	LOCKTAG_TUPLE,				/* one physical tuple */
 	LOCKTAG_TRANSACTION,		/* transaction (for waiting for xact done) */
@@ -195,6 +196,15 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
+/* ID info for frozen IDs is DB OID */
+#define SET_LOCKTAG_DATABASE_FROZEN_IDS(locktag,dboid) \
+	((locktag).locktag_field1 = (dboid), \
+	 (locktag).locktag_field2 = 0, \
+	 (locktag).locktag_field3 = 0, \
+	 (locktag).locktag_field4 = 0, \
+	 (locktag).locktag_type = LOCKTAG_DATABASE_FROZEN_IDS, \
+	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
+
 /* ID info for a page is RELATION info + BlockNumber */
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Noah Misch (#2)

Re: SimpleLruTruncate() mutual exclusion

Noah Misch <noah@leadboat.com> writes:

On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:

b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than
AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
checkpoint, or in the startup process. Hence, also arrange for only one
backend to call SimpleLruTruncate(AsyncCtl) at a time.

More specifically, restrict vac_update_datfrozenxid() to one backend per
database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
backend per cluster. This, attached, was rather straightforward.

Rebased. The conflicts were limited to comments and documentation.

I tried to review this, along with your adjacent patch to adjust the
segment-roundoff logic, but I didn't get far. I see the point that
the roundoff might create problems when we are close to hitting
wraparound. I do not, however, see why serializing vac_truncate_clog
helps. I'm pretty sure it was intentional that multiple backends
could run truncation directory scans concurrently, and I don't really
want to give that up if possible.

Also, if I understand the data-loss hazard properly, it's what you
said in the other thread: the latest_page_number could advance after
we make our decision about what to truncate, and then maybe we could
truncate newly-added data. We surely don't want to lock out the
operations that can advance last_page_number, so how does serializing
vac_truncate_clog help?

I wonder whether what we need to do is add some state in shared
memory saying "it is only safe to create pages before here", and
make SimpleLruZeroPage or its callers check that. The truncation
logic would advance that value, but only after completing a scan.
(Oh ..., hmm, maybe the point of serializing truncations is to
ensure that we know that we can safely advance that?)

Can you post whatever script you used to detect/reproduce the problem
in the first place?

regards, tom lane

PS: also, re-reading this code, it's worrisome that we are not checking
for failure of the unlink calls. I think the idea was that it didn't
really matter because if we did re-use an existing file we'd just
re-zero it; hence failing the truncate is an overreaction. But probably
some comments about that are in order.

Noah Misch

noah@leadboat.com

over 6 years ago

In reply to: Tom Lane (#3)

2 attachment(s)

Re: SimpleLruTruncate() mutual exclusion

On Mon, Jul 29, 2019 at 12:58:17PM -0400, Tom Lane wrote:

On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:

b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than
AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
checkpoint, or in the startup process. Hence, also arrange for only one
backend to call SimpleLruTruncate(AsyncCtl) at a time.

More specifically, restrict vac_update_datfrozenxid() to one backend per
database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
backend per cluster. This, attached, was rather straightforward.

I'm pretty sure it was intentional that multiple backends
could run truncation directory scans concurrently, and I don't really
want to give that up if possible.

I want to avoid a design that requires painstaking concurrency analysis. Such
analysis is worth it for heap_update(), but vac_truncate_clog() isn't enough
of a hot path. If there's a way to make vac_truncate_clog() easy to analyze
and still somewhat concurrent, fair.

Also, if I understand the data-loss hazard properly, it's what you
said in the other thread: the latest_page_number could advance after
we make our decision about what to truncate, and then maybe we could
truncate newly-added data. We surely don't want to lock out the
operations that can advance last_page_number, so how does serializing
vac_truncate_clog help?

I wonder whether what we need to do is add some state in shared
memory saying "it is only safe to create pages before here", and
make SimpleLruZeroPage or its callers check that. The truncation
logic would advance that value, but only after completing a scan.
(Oh ..., hmm, maybe the point of serializing truncations is to
ensure that we know that we can safely advance that?)

vac_truncate_clog() already records "it is only safe to create pages before
here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks.
The trouble comes when two vac_truncate_clog() run in parallel and you get a
sequence of events like this:

vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 1 unlinks segment ABCD
vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
vac_truncate_clog() instance 1 finishes
some backend calls SimpleLruZeroPage(), creating segment ABCD
vac_truncate_clog() instance 2 unlinks segment ABCD

Serializing vac_truncate_clog() fixes that.

Can you post whatever script you used to detect/reproduce the problem
in the first place?

I've attached it, but it's pretty hackish. Apply the patch on commit 7170268,
make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
directory, and run "make trunc-check". This incorporates xid-burn
acceleration code that Jeff Janes shared with -hackers at some point.

PS: also, re-reading this code, it's worrisome that we are not checking
for failure of the unlink calls. I think the idea was that it didn't
really matter because if we did re-use an existing file we'd just
re-zero it; hence failing the truncate is an overreaction. But probably
some comments about that are in order.

That's my understanding. We'll zero any page before reusing it. Failing the
whole vac_truncate_clog() (and therefore not advancing the wrap limit) would
do more harm than the bit of wasted disk space. Still, a LOG or WARNING
message would be fine, I think.

Thanks,
nm

Attachments:

repro-vac_truncate_clog-race-v0.patchtext/plain; charset=us-asciiDownload

diff --git a/GNUmakefile.in b/GNUmakefile.in
index f4e31a7..284fba1 100644
--- a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ -67,6 +67,20 @@ check check-tests installcheck installcheck-parallel installcheck-tests: CHECKPR
 check check-tests installcheck installcheck-parallel installcheck-tests: submake-generated-headers
 	$(MAKE) -C src/test/regress $@
 
+TEST_PGDATA_DATA=/var/tmp/test-pg
+trunc_check:
+	make -j20 install PROFILE=-O0
+	$(CC) -g -Wall -I`pg_config --includedir` -L`pg_config --libdir` -Wl,-R`pg_config --libdir` trunc-clog-concurrency.c -lpq -o trunc-clog-concurrency-tester
+	pg_ctl -m fast -w stop ||:
+	rm -rf $(TEST_PGDATA_DATA)
+	initdb -N $(TEST_PGDATA_DATA)
+	cp -p ~/conf-test-pg $(TEST_PGDATA_DATA)/postgresql.conf
+	echo "default_transaction_isolation = 'read committed'" >>$(TEST_PGDATA_DATA)/postgresql.conf
+	pg_ctl -w start
+	createdb
+	psql -Xc 'alter database template0 allow_connections on'
+	env time ./trunc-clog-concurrency-tester
+
 $(call recurse,check-world,src/test src/pl src/interfaces/ecpg contrib src/bin,check)
 $(call recurse,checkprep,  src/test src/pl src/interfaces/ecpg contrib src/bin)
 
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index aa089d8..7202f34 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
+#include "utils/fmgrprotos.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -917,6 +918,22 @@ TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid)
 	if (!SlruScanDirectory(ClogCtl, SlruScanDirCbReportPresence, &cutoffPage))
 		return;					/* nothing to remove */
 
+#if 0
+	/* FIXME Move sleep duration into a GUC? */
+	if (LWLockConditionalAcquire(TruncSleepLock, LW_EXCLUSIVE))
+	{
+		elog(LOG, "TruncSleepLock taken: sleeping (%d for %u)",
+			 cutoffPage, oldestXact);
+		DirectFunctionCall1(pg_sleep, Float8GetDatum(10.0));
+		/* TODO increase time, attach debugger and check caller vars */
+		LWLockRelease(TruncSleepLock);
+		elog(LOG, "TruncSleepLock done: proceeding");
+	}
+	else
+		elog(LOG, "TruncSleepLock unavailable: proceeding (%d for %u)",
+			 cutoffPage, oldestXact);
+#endif
+
 	/*
 	 * Advance oldestClogXid before truncating clog, so concurrent xact status
 	 * lookups can ensure they don't attempt to access truncated-away clog.
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352..ab28eda 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -57,6 +57,7 @@
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
 #include "miscadmin.h"
 
 
@@ -1171,11 +1172,6 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
 	int			slotno;
 
 	/*
-	 * The cutoff point is the start of the segment containing cutoffPage.
-	 */
-	cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT;
-
-	/*
 	 * Scan shared memory and remove any pages preceding the cutoff page, to
 	 * ensure we won't rewrite them later.  (Since this is normally called in
 	 * or just after a checkpoint, any dirty pages should have been flushed
@@ -1191,6 +1187,21 @@ restart:;
 	 * have already wrapped around, and proceeding with the truncation would
 	 * risk removing the current segment.
 	 */
+	if (shared->ControlLock == CLogControlLock)
+	{
+		int test = shared->latest_page_number;
+		elog(WARNING, "important safety check: %d latest < %d cutoff?",
+			 shared->latest_page_number, cutoffPage);
+		while (test < 130000)
+		{
+			if (ctl->PagePrecedes(test, cutoffPage))
+			{
+				elog(WARNING, "safety check would trip at %d", test);
+				break;
+			}
+			test++;
+		}
+	}
 	if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
 	{
 		LWLockRelease(shared->ControlLock);
@@ -1233,6 +1244,29 @@ restart:;
 
 	LWLockRelease(shared->ControlLock);
 
+#if 1
+	if (shared->ControlLock == CLogControlLock)
+	{
+		/* FIXME Move sleep duration into a GUC? */
+		if (LWLockConditionalAcquire(TruncSleepLock, LW_EXCLUSIVE))
+		{
+			elog(LOG, "TruncSleepLock taken: sleeping (%d)",
+				 cutoffPage);
+			DirectFunctionCall1(pg_sleep, Float8GetDatum(10.0));
+			/* TODO increase time, attach debugger and check caller vars */
+			LWLockRelease(TruncSleepLock);
+			elog(LOG, "TruncSleepLock done: proceeding");
+		}
+		else
+			elog(LOG, "TruncSleepLock unavailable: proceeding (%d)",
+				 cutoffPage);
+	}
+#endif
+
+	if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+		ereport(LOG,
+				(errmsg("too late, but apparent wraparound")));
+
 	/* Now we can remove the old segment(s) */
 	(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
 }
@@ -1320,11 +1354,10 @@ restart:
 bool
 SlruScanDirCbReportPresence(SlruCtl ctl, char *filename, int segpage, void *data)
 {
+	int			seg_last_page = segpage + SLRU_PAGES_PER_SEGMENT - 1;
 	int			cutoffPage = *(int *) data;
 
-	cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT;
-
-	if (ctl->PagePrecedes(segpage, cutoffPage))
+	if (ctl->PagePrecedes(seg_last_page, cutoffPage))
 		return true;			/* found one; don't iterate any more */
 
 	return false;				/* keep going */
@@ -1337,9 +1370,10 @@ SlruScanDirCbReportPresence(SlruCtl ctl, char *filename, int segpage, void *data
 static bool
 SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 {
+	int			seg_last_page = segpage + SLRU_PAGES_PER_SEGMENT - 1;
 	int			cutoffPage = *(int *) data;
 
-	if (ctl->PagePrecedes(segpage, cutoffPage))
+	if (ctl->PagePrecedes(seg_last_page, cutoffPage))
 		SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fe94fda..37db959 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -33,6 +33,8 @@
 /* pointer to "variable cache" in shared memory (set up by shmem.c) */
 VariableCache ShmemVariableCache = NULL;
 
+int JJ_xid=0;
+
 
 /*
  * Allocate the next XID for a new transaction or subtransaction.
@@ -168,6 +170,11 @@ GetNewTransactionId(bool isSubXact)
 	 *
 	 * Extend pg_subtrans and pg_commit_ts too.
 	 */
+	{
+	int		incr;
+	for (incr=0; incr <=JJ_xid; incr++)
+	{
+	xid = ShmemVariableCache->nextXid;
 	ExtendCLOG(xid);
 	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
@@ -180,6 +187,13 @@ GetNewTransactionId(bool isSubXact)
 	 */
 	TransactionIdAdvance(ShmemVariableCache->nextXid);
 
+	/* If JJ_xid opposes xidStopLimit, the latter wins */
+	if (TransactionIdFollowsOrEquals(ShmemVariableCache->nextXid,
+									 ShmemVariableCache->xidStopLimit))
+		break;
+	}
+	}
+
 	/*
 	 * We must store the new XID into the shared ProcArray before releasing
 	 * XidGenLock.  This ensures that every active XID older than
@@ -302,9 +316,7 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	 * We'll refuse to continue assigning XIDs in interactive mode once we get
 	 * within 1M transactions of data loss.  This leaves lots of room for the
 	 * DBA to fool around fixing things in a standalone backend, while not
-	 * being significant compared to total XID space. (Note that since
-	 * vacuuming requires one transaction per table cleaned, we had better be
-	 * sure there's lots of XIDs left...)
+	 * being significant compared to total XID space.
 	 */
 	xidStopLimit = xidWrapLimit - 1000000;
 	if (xidStopLimit < FirstNormalTransactionId)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a707d4d..fa7ab6a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5143,7 +5143,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * that by launching another iteration as soon as the current one
 		 * completes.
 		 */
-		start_autovac_launcher = true;
+		/* start_autovac_launcher = true; */
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) &&
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843..335521c 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 LogicalRepWorkerLock				43
 CLogTruncationLock					44
+TruncSleepLock						45
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f81e042..4e59a68 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -116,6 +116,7 @@
 /* XXX these should appear in other modules' header files */
 extern bool Log_disconnections;
 extern int	CommitDelay;
+extern int	JJ_xid;
 extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
@@ -2617,6 +2618,15 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"JJ_xid", PGC_USERSET, WAL_SETTINGS,
+			gettext_noop("Skip this many xid every time we acquire one"),
+			NULL
+		},
+		&JJ_xid,
+		0, 0, 1000000, NULL, NULL
+	},
+
+	{
 		{"commit_siblings", PGC_USERSET, WAL_SETTINGS,
 			gettext_noop("Sets the minimum concurrent open transactions before performing "
 						 "commit_delay."),
diff --git a/trunc-clog-concurrency.c b/trunc-clog-concurrency.c
new file mode 100644
index 0000000..2c35dd9
--- /dev/null
+++ b/trunc-clog-concurrency.c
@@ -0,0 +1,178 @@
+#include <libpq-fe.h>
+
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+static void
+report_query_failure(const char *query, PGresult *res)
+{
+	fprintf(stderr, "query \"%s\" failed unexpectedly: %s",
+			query, PQresultErrorMessage(res));
+}
+
+static void
+safe_query(PGconn *conn, const char *query)
+{
+	PGresult *res;
+
+	res = PQexec(conn, query);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK &&
+		PQresultStatus(res) != PGRES_COMMAND_OK)
+	{
+		report_query_failure(query, res);
+		exit(1);
+	}
+	PQclear(res);
+}
+
+static int
+is_stop_limit(PGresult *res)
+{
+	return PQresultStatus(res) == PGRES_FATAL_ERROR
+		&& strcmp(PQresultErrorField(res, PG_DIAG_SQLSTATE), "54000") == 0;
+}
+
+int
+main(int argc, char **argv)
+{
+	bool reached_stop_limit = false;
+	PGconn *mutate_conn, *hold1_conn, *hold2_conn, *burn_conn;
+	PGresult *res;
+	int n_burns = 0, n_inserts = 0;
+
+	if (argc != 1)
+	{
+		fputs("Usage: trunc-clog-concurrency\n", stderr);
+		return 1;
+	}
+
+	mutate_conn = PQconnectdb("");
+	if (PQstatus(mutate_conn) != CONNECTION_OK)
+	{
+		fprintf(stderr, "PGconnectdb failed: %s", PQerrorMessage(mutate_conn));
+		return 1;
+	}
+
+	hold1_conn = PQconnectdb("");
+	if (PQstatus(hold1_conn) != CONNECTION_OK)
+	{
+		fprintf(stderr, "PGconnectdb failed: %s", PQerrorMessage(hold1_conn));
+		return 1;
+	}
+
+	hold2_conn = PQconnectdb("");
+	if (PQstatus(hold2_conn) != CONNECTION_OK)
+	{
+		fprintf(stderr, "PGconnectdb failed: %s", PQerrorMessage(hold2_conn));
+		return 1;
+	}
+
+	burn_conn = PQconnectdb("options=--JJ_xid=1000000");
+	if (PQstatus(burn_conn) != CONNECTION_OK)
+	{
+		fprintf(stderr, "PGconnectdb failed: %s", PQerrorMessage(burn_conn));
+		return 1;
+	}
+
+	/* Start a transaction having an xid. */
+	safe_query(mutate_conn, "BEGIN;");
+	safe_query(mutate_conn, "DROP TABLE IF EXISTS trunc_clog_concurrency;");
+	safe_query(mutate_conn, "CREATE TABLE trunc_clog_concurrency ();");
+
+	/* Burn the entire XID space. */
+	while (!reached_stop_limit)
+	{
+		const char query[] = "SELECT txid_current();";
+		res = PQexec(burn_conn, query);
+		if (PQresultStatus(res) == PGRES_TUPLES_OK)
+		{
+			++n_burns;
+			if (n_burns == 2)
+			{
+				safe_query(hold1_conn, "BEGIN ISOLATION LEVEL READ COMMITTED;");
+				safe_query(hold1_conn, "CREATE TABLE trunc_clog_concurrency_hold1 ();");
+			}
+			if (n_burns == 10)
+			{
+				safe_query(hold2_conn, "BEGIN ISOLATION LEVEL READ COMMITTED;");
+				safe_query(hold2_conn, "CREATE TABLE trunc_clog_concurrency_hold2 ();");
+				system("psql -Xc 'select state, backend_xid, backend_xmin, query from pg_stat_activity'");
+				system("psql -Xc 'select datname,datallowconn,datfrozenxid from pg_database'");
+			}
+			/* keep burning */;
+		}
+		else if (is_stop_limit(res))
+		{
+			reached_stop_limit = true;
+			fprintf(stderr, "reached stop limit: %s",
+					PQresultErrorMessage(res));
+		}
+		else
+		{
+			reached_stop_limit = true; /* FIXME not really */
+			report_query_failure(query, res);
+		}
+		PQclear(res);
+	}
+
+	/* Finish the first transaction.  xmin raises from start to start+2M. */
+	safe_query(mutate_conn, "COMMIT;");
+
+	/* Raise datfrozenxid of all but template1 to start+2M.  No truncation. */
+	system("for d in postgres template0 test; do vacuumdb -F $d; done; "
+		   "echo -n 'DONE1 '; date");
+	/* Raise xmin to start+10M */
+	safe_query(hold1_conn, "COMMIT;");
+	/* Sleep on lock before truncating to start+2M. */
+	system("(vacuumdb -F template1; echo -n 'DONE2 '; date) &");
+	usleep(4000*1000); /* 4s */
+
+	/* Truncate to start+10M. */
+	system("(vacuumdb -aF; echo -n 'DONE3 '; date)");
+	system("psql -Xc 'select state, backend_xid, backend_xmin, query from pg_stat_activity'");
+	system("psql -Xc 'select datname,datallowconn,datfrozenxid from pg_database'");
+
+	/*
+	 * We want to burn at least 1M xids (the amount protected by xidStopLimit)
+	 * but not more than 200M (autovacuum_freeze_max_age default) to avoid a
+	 * second set of VACUUMs.
+	 */
+	while (n_inserts < 150)
+	{
+		const char query[] =
+			"INSERT INTO trunc_clog_concurrency DEFAULT VALUES";
+		res = PQexec(burn_conn, query);
+		if (PQresultStatus(res) == PGRES_COMMAND_OK)
+		{
+			n_inserts++;
+			fprintf(stderr, "insert %d ", n_inserts);
+			system("date >&2");
+		}
+		else if (is_stop_limit(res))
+		{
+			fprintf(stderr, "reached stop limit: %s",
+					PQresultErrorMessage(res));
+			break;
+		}
+		else
+		{
+			report_query_failure(query, res);
+			return 1;
+		}
+		PQclear(res);
+	}
+
+	system("psql -Xc 'select state, backend_xid, backend_xmin, query from pg_stat_activity'");
+	system("psql -Xc 'select datname,datallowconn,datfrozenxid from pg_database'");
+
+	PQfinish(mutate_conn);
+	PQfinish(hold1_conn);
+	PQfinish(hold2_conn);
+	PQfinish(burn_conn);
+
+	return 0;
+}

conf-test-pgtext/plain; charset=us-asciiDownload

Thomas Munro

thomas.munro@gmail.com

about 6 years ago

In reply to: Noah Misch (#4)

Re: SimpleLruTruncate() mutual exclusion

On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote:

vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 1 unlinks segment ABCD
vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
vac_truncate_clog() instance 1 finishes
some backend calls SimpleLruZeroPage(), creating segment ABCD
vac_truncate_clog() instance 2 unlinks segment ABCD

Serializing vac_truncate_clog() fixes that.

I've wondered before (in a -bugs thread[1]/messages/by-id/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V+o_3BZ=buVLQBtRg@mail.gmail.com about unexplained pg_serial
wraparound warnings) if we could map 64 bit xids to wide SLRU file
names that never wrap around and make this class of problem go away.
Unfortunately multixacts would need 64 bit support too...

[1]: /messages/by-id/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V+o_3BZ=buVLQBtRg@mail.gmail.com

Noah Misch

noah@leadboat.com

about 6 years ago

In reply to: Thomas Munro (#5)

Re: SimpleLruTruncate() mutual exclusion

On Mon, Nov 04, 2019 at 03:26:35PM +1300, Thomas Munro wrote:

On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote:

vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 1 unlinks segment ABCD
vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
vac_truncate_clog() instance 1 finishes
some backend calls SimpleLruZeroPage(), creating segment ABCD
vac_truncate_clog() instance 2 unlinks segment ABCD

Serializing vac_truncate_clog() fixes that.

I've wondered before (in a -bugs thread[1] about unexplained pg_serial
wraparound warnings) if we could map 64 bit xids to wide SLRU file
names that never wrap around and make this class of problem go away.
Unfortunately multixacts would need 64 bit support too...

[1] /messages/by-id/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V+o_3BZ=buVLQBtRg@mail.gmail.com

That change sounds good to me.

Noah Misch

noah@leadboat.com

about 6 years ago

In reply to: Noah Misch (#6)

Re: SimpleLruTruncate() mutual exclusion

On Mon, Nov 04, 2019 at 03:43:09PM -0800, Noah Misch wrote:

On Mon, Nov 04, 2019 at 03:26:35PM +1300, Thomas Munro wrote:

On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote:

vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 1 unlinks segment ABCD
vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
vac_truncate_clog() instance 1 finishes
some backend calls SimpleLruZeroPage(), creating segment ABCD
vac_truncate_clog() instance 2 unlinks segment ABCD

Serializing vac_truncate_clog() fixes that.

I've wondered before (in a -bugs thread[1] about unexplained pg_serial
wraparound warnings) if we could map 64 bit xids to wide SLRU file
names that never wrap around and make this class of problem go away.
Unfortunately multixacts would need 64 bit support too...

[1] /messages/by-id/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V+o_3BZ=buVLQBtRg@mail.gmail.com

That change sounds good to me.

I do think $SUBJECT should move forward independent of that.

Dmitry Dolgov

9erthalion6@gmail.com

about 6 years ago

In reply to: Noah Misch (#4)

Re: SimpleLruTruncate() mutual exclusion

On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote:

Hi,

Also, if I understand the data-loss hazard properly, it's what you
said in the other thread: the latest_page_number could advance after
we make our decision about what to truncate, and then maybe we could
truncate newly-added data. We surely don't want to lock out the
operations that can advance last_page_number, so how does serializing
vac_truncate_clog help?

I wonder whether what we need to do is add some state in shared
memory saying "it is only safe to create pages before here", and
make SimpleLruZeroPage or its callers check that. The truncation
logic would advance that value, but only after completing a scan.
(Oh ..., hmm, maybe the point of serializing truncations is to
ensure that we know that we can safely advance that?)

vac_truncate_clog() already records "it is only safe to create pages before
here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks.
The trouble comes when two vac_truncate_clog() run in parallel and you get a
sequence of events like this:

vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 1 unlinks segment ABCD
vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
vac_truncate_clog() instance 1 finishes
some backend calls SimpleLruZeroPage(), creating segment ABCD
vac_truncate_clog() instance 2 unlinks segment ABCD

Serializing vac_truncate_clog() fixes that.

I'm probably missing something, so just wanted to clarify. Do I
understand correctly, that thread [1]/messages/by-id/20190202083822.GC32531@gust.leadboat.com and this one are independent, and
it is assumed in the scenario above that we're at the end of XID space,
but not necessarily having rounding issues? I'm a bit confused, since
the reproduce script does something about cutoffPage, and I'm not sure
if it's important or not.

Can you post whatever script you used to detect/reproduce the problem
in the first place?

I've attached it, but it's pretty hackish. Apply the patch on commit 7170268,
make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
directory, and run "make trunc-check". This incorporates xid-burn
acceleration code that Jeff Janes shared with -hackers at some point.

I've tried to use it to understand the problem better, but somehow
couldn't reproduce via suggested script. I've applied it to 7170268
(tried also apply rebased to the master with the same results) with the
conf-test-pg in place, but after going through all steps there are no
errors like:

ERROR: could not access status of transaction 2149484247
DETAIL: Could not open file "pg_xact/0801": No such file or directory.

Is there anything extra one needs to do to reproduce the problem, maybe
adjust delays or something?

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

Noah Misch

noah@leadboat.com

about 6 years ago

In reply to: Dmitry Dolgov (#8)

Re: SimpleLruTruncate() mutual exclusion

On Sun, Nov 17, 2019 at 12:56:52PM +0100, Dmitry Dolgov wrote:

On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote:

Also, if I understand the data-loss hazard properly, it's what you
said in the other thread: the latest_page_number could advance after
we make our decision about what to truncate, and then maybe we could
truncate newly-added data. We surely don't want to lock out the
operations that can advance last_page_number, so how does serializing
vac_truncate_clog help?

I wonder whether what we need to do is add some state in shared
memory saying "it is only safe to create pages before here", and
make SimpleLruZeroPage or its callers check that. The truncation
logic would advance that value, but only after completing a scan.
(Oh ..., hmm, maybe the point of serializing truncations is to
ensure that we know that we can safely advance that?)

vac_truncate_clog() already records "it is only safe to create pages before
here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks.
The trouble comes when two vac_truncate_clog() run in parallel and you get a
sequence of events like this:

vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 1 unlinks segment ABCD
vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
vac_truncate_clog() instance 1 finishes
some backend calls SimpleLruZeroPage(), creating segment ABCD
vac_truncate_clog() instance 2 unlinks segment ABCD

Serializing vac_truncate_clog() fixes that.

I'm probably missing something, so just wanted to clarify. Do I
understand correctly, that thread [1] and this one are independent, and
it is assumed in the scenario above that we're at the end of XID space,
but not necessarily having rounding issues? I'm a bit confused, since
the reproduce script does something about cutoffPage, and I'm not sure
if it's important or not.

I think the repro recipe contained an early fix for the other thread's bug.
While they're independent in principle, I likely never reproduced this bug
without having a fix in place for the other thread's bug. The bug in the
other thread was easier to hit.

Can you post whatever script you used to detect/reproduce the problem
in the first place?

I've attached it, but it's pretty hackish. Apply the patch on commit 7170268,
make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
directory, and run "make trunc-check". This incorporates xid-burn
acceleration code that Jeff Janes shared with -hackers at some point.

I've tried to use it to understand the problem better, but somehow
couldn't reproduce via suggested script. I've applied it to 7170268
(tried also apply rebased to the master with the same results) with the
conf-test-pg in place, but after going through all steps there are no
errors like:

That is unfortunate.

Is there anything extra one needs to do to reproduce the problem, maybe
adjust delays or something?

It wouldn't surprise me. I did all my testing on one or two systems; the
hard-coded delays suffice there, but I'm sure there exist systems needing
different delays.

Though I did reproduce this bug, I'm motivated by the abstract problem more
than any particular way to reproduce it. Commit 996d273 inspired me; by
removing a GetCurrentTransactionId(), it allowed the global xmin to advance at
times it previously could not. That subtly changed the concurrency
possibilities. I think safe, parallel SimpleLruTruncate() is difficult to
maintain and helps too rarely to justify such maintenance. That's why I
propose eliminating the concurrency.

Show quoted text

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

#10

Dmitry Dolgov

9erthalion6@gmail.com

about 6 years ago

In reply to: Noah Misch (#9)

Re: SimpleLruTruncate() mutual exclusion

On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote:

Though I did reproduce this bug, I'm motivated by the abstract problem more
than any particular way to reproduce it. Commit 996d273 inspired me; by
removing a GetCurrentTransactionId(), it allowed the global xmin to advance at
times it previously could not. That subtly changed the concurrency
possibilities. I think safe, parallel SimpleLruTruncate() is difficult to
maintain and helps too rarely to justify such maintenance. That's why I
propose eliminating the concurrency.

Sure, I see the point and the possibility for the issue itself, but of
course it's easier to reason about an issue I can reproduce :)

I wonder about performance in a database with millions of small relations,
particularly considering my intent to back-patch this. In such databases,
vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two
things work in our favor. First, vac_update_datfrozenxid() runs once per
VACUUM command, not once per relation. Second, Autovacuum has this logic:

* ... we skip
* this if (1) we found no work to do and (2) we skipped at least one
* table due to concurrent autovacuum activity. In that case, the other
* worker has already done it, or will do so when it finishes.
*/
if (did_vacuum || !found_concurrent_worker)
vac_update_datfrozenxid();

That makes me relatively unworried. I did consider some alternatives:

Btw, I've performed few experiments with parallel vacuuming of 10^4
small tables that are taking some small inserts, the results look like
this:

# with patch
# funclatency -u bin/postgres:vac_update_datfrozenxid

usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 3 |*** |
1024 -> 2047 : 38 |****************************************|
2048 -> 4095 : 15 |*************** |
4096 -> 8191 : 15 |*************** |
8192 -> 16383 : 2 |** |

# without patch
# funclatency -u bin/postgres:vac_update_datfrozenxid

usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 5 |**** |
1024 -> 2047 : 49 |****************************************|
2048 -> 4095 : 11 |******** |
4096 -> 8191 : 5 |**** |
8192 -> 16383 : 1 | |

In general it seems that latency tends to be a bit higher, but I don't
think it's significant.

#11

Dmitry Dolgov

9erthalion6@gmail.com

almost 6 years ago

In reply to: Noah Misch (#9)

Re: SimpleLruTruncate() mutual exclusion

On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote:

I'm probably missing something, so just wanted to clarify. Do I
understand correctly, that thread [1] and this one are independent, and
it is assumed in the scenario above that we're at the end of XID space,
but not necessarily having rounding issues? I'm a bit confused, since
the reproduce script does something about cutoffPage, and I'm not sure
if it's important or not.

I think the repro recipe contained an early fix for the other thread's bug.
While they're independent in principle, I likely never reproduced this bug
without having a fix in place for the other thread's bug. The bug in the
other thread was easier to hit.

Just to clarify, since the CF item for this patch was withdrawn
recently. Does it mean that eventually the thread [1]/messages/by-id/20190202083822.GC32531@gust.leadboat.com covers this one
too?

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

#12

Noah Misch

noah@leadboat.com

almost 6 years ago

In reply to: Dmitry Dolgov (#11)

Re: SimpleLruTruncate() mutual exclusion

On Thu, Jan 30, 2020 at 04:34:33PM +0100, Dmitry Dolgov wrote:

On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote:

I'm probably missing something, so just wanted to clarify. Do I
understand correctly, that thread [1] and this one are independent, and
it is assumed in the scenario above that we're at the end of XID space,
but not necessarily having rounding issues? I'm a bit confused, since
the reproduce script does something about cutoffPage, and I'm not sure
if it's important or not.

I think the repro recipe contained an early fix for the other thread's bug.
While they're independent in principle, I likely never reproduced this bug
without having a fix in place for the other thread's bug. The bug in the
other thread was easier to hit.

Just to clarify, since the CF item for this patch was withdrawn
recently. Does it mean that eventually the thread [1] covers this one
too?

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

I withdrew $SUBJECT because, if someone reviews one of my patches, I want it
to be the one you cite in [1]. I plan not to commit [1] without a Ready for
Committer, and I plan not to commit $SUBJECT before committing [1]. I would
be willing to commit $SUBJECT without getting a review, however.

#13

Noah Misch

noah@leadboat.com

over 5 years ago

In reply to: Noah Misch (#2)

1 attachment(s)

Re: SimpleLruTruncate() mutual exclusion

On Fri, Jun 28, 2019 at 10:06:28AM -0700, Noah Misch wrote:

On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:

I'm forking this thread from
/messages/by-id/flat/20190202083822.GC32531@gust.leadboat.com, which
reported a race condition involving the "apparent wraparound" safety check in
SimpleLruTruncate():

On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:

1. The result of the test is valid only until we release the SLRU ControlLock,
which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
segments for deletion. Once we release that lock, latest_page_number can
advance. This creates a TOCTOU race condition, allowing excess deletion:

[local] test=# table trunc_clog_concurrency ;
ERROR: could not access status of transaction 2149484247
DETAIL: Could not open file "pg_xact/0801": No such file or directory.

Fixes are available:

b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than
AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
checkpoint, or in the startup process. Hence, also arrange for only one
backend to call SimpleLruTruncate(AsyncCtl) at a time.

More specifically, restrict vac_update_datfrozenxid() to one backend per
database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
backend per cluster. This, attached, was rather straightforward.

Rebased. The conflicts were limited to comments and documentation.

Rebased, most notably over the lock renames of commit 5da1493. In
LockTagTypeNames, I did s/frozen IDs/frozenid/ for consistency with that
commit having done s/speculative token/spectoken/.

Attachments:

wrap-limits-mutex-v3.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Prevent concurrent SimpleLruTruncate() for any given SLRU.
    
    The SimpleLruTruncate() header comment states the new coding rule.  This
    closes race conditions that presented rare opportunities for data loss.
    Symptoms and recommendations to users mirror the previous commit[1].
    Back-patch to 9.5 (all supported versions).
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
    
    [1] (footnote to be removed before commit) refers to a patch of thread
    https://postgr.es/m/flat/20190217040913.GA1305059@rfd.leadboat.com

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb1..1125c3c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1717,6 +1717,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to extend a relation.</entry>
      </row>
      <row>
+      <entry><literal>frozenid</literal></entry>
+      <entry>Waiting to
+       update <structname>pg_database</structname>.<structfield>datfrozenxid</structfield>
+       and <structname>pg_database</structname>.<structfield>datminmxid</structfield>.</entry>
+     </row>
+     <row>
       <entry><literal>object</literal></entry>
       <entry>Waiting to acquire a lock on a non-relation database object.</entry>
      </row>
@@ -1885,6 +1891,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to read or update <command>NOTIFY</command> messages.</entry>
      </row>
      <row>
+      <entry><literal>NotifyQueueTail</literal></entry>
+      <entry>Waiting to update limit on <command>NOTIFY</command> message
+       storage.</entry>
+     </row>
+     <row>
       <entry><literal>NotifySLRU</literal></entry>
       <entry>Waiting to access the <command>NOTIFY</command> message SLRU
        cache.</entry>
@@ -2061,6 +2072,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for WAL buffers to be written to disk.</entry>
      </row>
      <row>
+      <entry><literal>WrapLimitsVacuum</literal></entry>
+      <entry>Waiting to update limits on transaction id and multixact
+       consumption.</entry>
+     </row>
+     <row>
       <entry><literal>XactBuffer</literal></entry>
       <entry>Waiting for I/O on a transaction status SLRU buffer.</entry>
      </row>
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 61249f4..2750580 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1208,6 +1208,14 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 
 /*
  * Remove all segments before the one holding the passed page number
+ *
+ * All SLRUs prevent concurrent calls to this function, either with an LWLock
+ * or by calling it only as part of a checkpoint.  Mutual exclusion must begin
+ * before computing cutoffPage.  Mutual exclusion must end after any limit
+ * update that would permit other backends to write fresh data into the
+ * segment immediately preceding the one containing cutoffPage.  Otherwise,
+ * when the SLRU is quite full, SimpleLruTruncate() might delete that segment
+ * after it has accrued freshly-written data.
  */
 void
 SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index f33ae40..d9a31d3 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -349,8 +349,8 @@ ExtendSUBTRANS(TransactionId newestXact)
 /*
  * Remove all SUBTRANS segments before the one holding the passed transaction ID
  *
- * This is normally called during checkpoint, with oldestXact being the
- * oldest TransactionXmin of any running transaction.
+ * oldestXact is the oldest TransactionXmin of any running transaction.  This
+ * is called only during checkpoint.
  */
 void
 TruncateSUBTRANS(TransactionId oldestXact)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index a3ba88d..4e0db8e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -244,19 +244,22 @@ typedef struct QueueBackendStatus
 /*
  * Shared memory state for LISTEN/NOTIFY (excluding its SLRU stuff)
  *
- * The AsyncQueueControl structure is protected by the NotifyQueueLock.
+ * The AsyncQueueControl structure is protected by the NotifyQueueLock and
+ * NotifyQueueTailLock.
  *
- * When holding the lock in SHARED mode, backends may only inspect their own
- * entries as well as the head and tail pointers. Consequently we can allow a
- * backend to update its own record while holding only SHARED lock (since no
- * other backend will inspect it).
+ * When holding NotifyQueueLock in SHARED mode, backends may only inspect
+ * their own entries as well as the head and tail pointers. Consequently we
+ * can allow a backend to update its own record while holding only SHARED lock
+ * (since no other backend will inspect it).
  *
- * When holding the lock in EXCLUSIVE mode, backends can inspect the entries
- * of other backends and also change the head and tail pointers.
+ * When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
+ * entries of other backends and also change the head pointer. When holding
+ * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
+ * can change the tail pointer.
  *
  * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
- * In order to avoid deadlocks, whenever we need both locks, we always first
- * get NotifyQueueLock and then NotifySLRULock.
+ * In order to avoid deadlocks, whenever we need multiple locks, we first get
+ * NotifyQueueTailLock, then NotifyQueueLock, and lastly NotifySLRULock.
  *
  * Each backend uses the backend[] array entry with index equal to its
  * BackendId (which can range from 1 to MaxBackends).  We rely on this to make
@@ -2177,6 +2180,10 @@ asyncQueueAdvanceTail(void)
 	int			newtailpage;
 	int			boundary;
 
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(NotifyQueueTailLock, LW_EXCLUSIVE);
+
+	/* Compute the new tail. */
 	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
 	min = QUEUE_HEAD;
 	for (BackendId i = QUEUE_FIRST_LISTENER; i > 0; i = QUEUE_NEXT_LISTENER(i))
@@ -2185,7 +2192,6 @@ asyncQueueAdvanceTail(void)
 		min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
 	}
 	oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL);
-	QUEUE_TAIL = min;
 	LWLockRelease(NotifyQueueLock);
 
 	/*
@@ -2205,6 +2211,17 @@ asyncQueueAdvanceTail(void)
 		 */
 		SimpleLruTruncate(NotifyCtl, newtailpage);
 	}
+
+	/*
+	 * Advertise the new tail.  This changes asyncQueueIsFull()'s verdict for
+	 * the segment immediately prior to the new tail, allowing fresh data into
+	 * that segment.
+	 */
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	QUEUE_TAIL = min;
+	LWLockRelease(NotifyQueueLock);
+
+	LWLockRelease(NotifyQueueTailLock);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110ed..2911761 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1345,6 +1345,14 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
+	 * Restrict this task to one backend per database.  This avoids race
+	 * conditions that would move datfrozenxid or datminmxid backward or call
+	 * vac_truncate_clog() with a datfrozenxid preceding a datfrozenxid passed
+	 * to an earlier vac_truncate_clog() call.
+	 */
+	LockDatabaseFrozenIds(ExclusiveLock);
+
+	/*
 	 * Initialize the "min" calculation with GetOldestXmin, which is a
 	 * reasonable approximation to the minimum relfrozenxid for not-yet-
 	 * committed pg_class entries for new tables; see AddNewRelationTuple().
@@ -1533,6 +1541,9 @@ vac_truncate_clog(TransactionId frozenXID,
 	bool		bogus = false;
 	bool		frozenAlreadyWrapped = false;
 
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(WrapLimitsVacuumLock, LW_EXCLUSIVE);
+
 	/* init oldest datoids to sync with my frozenXID/minMulti values */
 	oldestxid_datoid = MyDatabaseId;
 	minmulti_datoid = MyDatabaseId;
@@ -1642,6 +1653,8 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+
+	LWLockRelease(WrapLimitsVacuumLock);
 }
 
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 2010320..7409de9 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -461,6 +461,21 @@ UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockDatabaseFrozenIds
+ *
+ * This allows one backend per database to execute vac_update_datfrozenxid().
+ */
+void
+LockDatabaseFrozenIds(LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_DATABASE_FROZEN_IDS(tag, MyDatabaseId);
+
+	(void) LockAcquire(&tag, lockmode, false, false);
+}
+
+/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -1098,6 +1113,11 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
+		case LOCKTAG_DATABASE_FROZEN_IDS:
+			appendStringInfo(buf,
+							 _("pg_database.datfrozenxid of database %u"),
+							 tag->locktag_field1);
+			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6985e8..a1e62de 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,5 @@ MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 LogicalRepWorkerLock				43
 XactTruncationLock					44
+WrapLimitsVacuumLock				45
+NotifyQueueTailLock					46
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index e992d1b..f592292 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -29,6 +29,7 @@
 const char *const LockTagTypeNames[] = {
 	"relation",
 	"extend",
+	"frozenid",
 	"page",
 	"tuple",
 	"transactionid",
@@ -254,6 +255,17 @@ pg_lock_status(PG_FUNCTION_ARGS)
 				nulls[8] = true;
 				nulls[9] = true;
 				break;
+			case LOCKTAG_DATABASE_FROZEN_IDS:
+				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
+				nulls[2] = true;
+				nulls[3] = true;
+				nulls[4] = true;
+				nulls[5] = true;
+				nulls[6] = true;
+				nulls[7] = true;
+				nulls[8] = true;
+				nulls[9] = true;
+				break;
 			case LOCKTAG_PAGE:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 3acc11a..f7cabcb 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -59,6 +59,9 @@ extern bool ConditionalLockRelationForExtension(Relation relation,
 												LOCKMODE lockmode);
 extern int	RelationExtensionLockWaiterCount(Relation relation);
 
+/* Lock to recompute pg_database.datfrozenxid in the current database */
+extern void LockDatabaseFrozenIds(LOCKMODE lockmode);
+
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fdabf42..1c3e9c1 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,6 +138,7 @@ typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
 	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
+	LOCKTAG_DATABASE_FROZEN_IDS,	/* pg_database.datfrozenxid */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	LOCKTAG_TUPLE,				/* one physical tuple */
 	LOCKTAG_TRANSACTION,		/* transaction (for waiting for xact done) */
@@ -194,6 +195,15 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
+/* ID info for frozen IDs is DB OID */
+#define SET_LOCKTAG_DATABASE_FROZEN_IDS(locktag,dboid) \
+	((locktag).locktag_field1 = (dboid), \
+	 (locktag).locktag_field2 = 0, \
+	 (locktag).locktag_field3 = 0, \
+	 (locktag).locktag_field4 = 0, \
+	 (locktag).locktag_type = LOCKTAG_DATABASE_FROZEN_IDS, \
+	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
+
 /* ID info for a page is RELATION info + BlockNumber */
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \

#14

Noah Misch

noah@leadboat.com

over 5 years ago

In reply to: Noah Misch (#12)

Re: SimpleLruTruncate() mutual exclusion

On Fri, Jan 31, 2020 at 12:42:13AM -0800, Noah Misch wrote:

On Thu, Jan 30, 2020 at 04:34:33PM +0100, Dmitry Dolgov wrote:

Just to clarify, since the CF item for this patch was withdrawn
recently. Does it mean that eventually the thread [1] covers this one
too?

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

I withdrew $SUBJECT because, if someone reviews one of my patches, I want it
to be the one you cite in [1]. I plan not to commit [1] without a Ready for
Committer, and I plan not to commit $SUBJECT before committing [1]. I would
be willing to commit $SUBJECT without getting a review, however.

After further reflection, I plan to push $SUBJECT shortly after 2020-08-13,
not waiting for [1] to change status. Reasons:

- While I put significant weight on the fact that I couldn't reproduce
$SUBJECT problems without first fixing [1], I'm no longer confident of that
representing real-world experiences. Reproducing [1] absolutely requires a
close approach to a wrap limit, but $SUBJECT problems might not.

- Adding locks won't change correct functional behavior to incorrect
functional behavior.

- By pushing at that particular time, the fix ordinarily will appear in v13.0
before appearing in a back branch release. If problematic contention arises
quickly in the field, that's a more-comfortable way to discover it.