autovacuum truncate exclusive lock round two

Started by Jan Wieckabout 13 years ago42 messages
#1Jan Wieck
JanWieck@Yahoo.com
1 attachment(s)

This problem has been discussed before. Those familiar with the subject
please skip the next paragraph.

When autovacuum finds a substantial amount of empty pages at the end of
a relation, it attempts to truncate it in lazy_truncate_heap(). Because
all the scanning had been done in parallel to normal DB activity, it
needs to verify that all those blocks are still empty. To do that
autovacuum grabs an AccessExclusiveLock on the relation, then scans
backwards to the last non-empty page. If any other backend needs to
access that table during this time, it will kill the autovacuum from the
deadlock detection code, which by default is done after a 1000ms
timeout. The autovacuum launcher will start another vacuum after
(default) 60 seconds, which most likely is getting killed again, and
again, and again. The net result of this is that the table is never
truncated and every 60 seconds there is a 1 second hiccup before the
autovacuum is killed.

Proposal:

Add functions to lmgr that are derived from the lock release code, but
instead of releasing the lock and waking up waiters, just return a
boolean telling if there are any waiters that would be woken up if this
lock was released.

Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c
to periodically check, if there is a conflicting lock request waiting.
If not, keep going. If there is a waiter, truncate the relation to the
point checked thus far, release the AccessExclusiveLock, then loop back
to where we acquire this lock in the first place and continue
checking/truncating.

I have a working patch here:

https://github.com/wieck/postgres/tree/autovacuum-truncate-lock

This patch does introduce three new postgresql.conf parameters, which I
would be happy to get rid of if we could derive them from something
else. Something based on the deadlock timeout may be possible.

autovacuum_truncate_lock_check = 100ms # how frequent to check
# for conflicting locks
autovacuum_truncate_lock_retry = 50 # how often to try acquiring
# the exclusive lock
autovacuum_truncate_lock_wait = 20ms # nap in between attempts

With these settings, I see the truncate of a bloated table progressing
at a rate of 3 minutes per GB, while that table is accessed 20 times per
second.

The original "kill autovacuum" mechanism in the deadlock code is still
there. All this code really does is 10 lmgr lookups per second and
releasing the AccessExclusiveLock if there are any waiters. I don't
think it can get any cheaper than this.

I am attaching a script that uses pgbench to demonstrate the actual
problem of a bloated table with significant empty pages at the end.

Comments?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachments:

t1.autovac-lock-issue.tgzapplication/x-compressed; name=t1.autovac-lock-issue.tgzDownload
#2Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#1)
1 attachment(s)
Re: autovacuum truncate exclusive lock round two

Here is the patch for it.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachments:

autovacuum-truncate-lock.difftext/x-patch; name=autovacuum-truncate-lock.diffDownload
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c9253a9..9f880f0 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 57,62 ****
--- 57,63 ----
  #include "utils/pg_rusage.h"
  #include "utils/timestamp.h"
  #include "utils/tqual.h"
+ #include "portability/instr_time.h"
  
  
  /*
*************** typedef struct LVRelStats
*** 103,108 ****
--- 104,110 ----
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 ****
--- 195,202 ----
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
  	vacrelstats->old_rel_tuples = onerel->rd_rel->reltuples;
  	vacrelstats->num_index_scans = 0;
+ 	vacrelstats->pages_removed = 0;
+ 	vacrelstats->lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 ****
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
--- 263,280 ----
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt 
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats->lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
+ 	else
+ 		vacstmt->options &= ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
*************** lazy_truncate_heap(Relation onerel, LVRe
*** 1255,1334 ****
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
! 	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
! 
! 	/*
! 	 * Now that we have exclusive lock, look to see if the rel has grown
! 	 * whilst we were vacuuming with non-exclusive lock.  If so, give up; the
! 	 * newly added pages presumably contain non-deletable tuples.
  	 */
! 	new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 	if (new_rel_pages != old_rel_pages)
  	{
  		/*
! 		 * Note: we intentionally don't update vacrelstats->rel_pages with the
! 		 * new rel size here.  If we did, it would amount to assuming that the
! 		 * new pages are empty, which is unlikely.	Leaving the numbers alone
! 		 * amounts to assuming that the new pages have the same tuple density
! 		 * as existing ones, which is less unlikely.
  		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Scan backwards from the end to verify that the end pages actually
! 	 * contain no tuples.  This is *necessary*, not optional, because other
! 	 * backends could have added tuples to these pages whilst we were
! 	 * vacuuming.
! 	 */
! 	new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
  
! 	if (new_rel_pages >= old_rel_pages)
! 	{
! 		/* can't do anything after all */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Okay to truncate.
! 	 */
! 	RelationTruncate(onerel, new_rel_pages);
  
! 	/*
! 	 * We can release the exclusive lock as soon as we have truncated.	Other
! 	 * backends can't safely access the relation until they have processed the
! 	 * smgr invalidation that smgrtruncate sent out ... but that should happen
! 	 * as part of standard invalidation processing once they acquire lock on
! 	 * the relation.
! 	 */
! 	UnlockRelation(onerel, AccessExclusiveLock);
  
! 	/*
! 	 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 	 * also touching reltuples, since the tuple count wasn't changed by the
! 	 * truncation.
! 	 */
! 	vacrelstats->rel_pages = new_rel_pages;
! 	vacrelstats->pages_removed = old_rel_pages - new_rel_pages;
  
! 	ereport(elevel,
! 			(errmsg("\"%s\": truncated %u to %u pages",
! 					RelationGetRelationName(onerel),
! 					old_rel_pages, new_rel_pages),
! 			 errdetail("%s.",
! 					   pg_rusage_show(&ru0))));
  }
  
  /*
--- 1267,1388 ----
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
+ 	int			lock_retry;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * Loop until no more truncating can be done.
  	 */
! 	do
  	{
  		/*
! 		 * We need full exclusive lock on the relation in order to do 
! 		 * truncation.
! 		 * If we can't get it, give up rather than waiting --- we don't want to
! 		 * block other backends, and we don't want to deadlock (which is quite
! 		 * possible considering we already hold a lower-grade lock).
  		 */
! 		vacrelstats->lock_waiter_detected = false;
! 		lock_retry = 0;
! 		while (true)
! 		{
! 			if (ConditionalLockRelation(onerel, AccessExclusiveLock))
! 				break;
  
! 			if (autovacuum_truncate_lock_retry == 0)
! 				return;
  
! 			/*
! 			 * Check for interrupts while trying to (re-)acquire
! 			 * the exclusive lock.
! 			 */
! 			CHECK_FOR_INTERRUPTS();
  
! 			if (++lock_retry > autovacuum_truncate_lock_retry)
! 			{
! 				/*
! 				 * We failed to establish the lock in the specified
! 				 * number of retries. This means we give up truncating.
! 				 * Suppress the ANALYZE step. Doing an ANALYZE at
! 				 * this point will reset the dead_tuple_count in the
! 				 * stats collector, so we will not get called by the
! 				 * autovacuum launcher again to do the truncate.
! 				 */
! 				vacrelstats->lock_waiter_detected = true;
! 				return;
! 			}
  
! 			if (autovacuum_truncate_lock_wait > 0)
! 				pg_usleep((long)autovacuum_truncate_lock_wait);
! 		}
  
! 		/*
! 		 * Now that we have exclusive lock, look to see if the rel has grown
! 		 * whilst we were vacuuming with non-exclusive lock.  If so, give up;
! 		 * the newly added pages presumably contain non-deletable tuples.
! 		 */
! 		new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 		if (new_rel_pages != old_rel_pages)
! 		{
! 			/*
! 			 * Note: we intentionally don't update vacrelstats->rel_pages 
! 			 * with the new rel size here.  If we did, it would amount to 
! 			 * assuming that the new pages are empty, which is unlikely.
! 			 * Leaving the numbers alone amounts to assuming that the new 
! 			 * pages have the same tuple density as existing ones, which 
! 			 * is less unlikely.
! 			 */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
  
! 		/*
! 		 * Scan backwards from the end to verify that the end pages actually
! 		 * contain no tuples.  This is *necessary*, not optional, because other
! 		 * backends could have added tuples to these pages whilst we were
! 		 * vacuuming.
! 		 */
! 		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
! 
! 		if (new_rel_pages >= old_rel_pages)
! 		{
! 			/* can't do anything after all */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
! 
! 		/*
! 		 * Okay to truncate.
! 		 */
! 		RelationTruncate(onerel, new_rel_pages);
! 
! 		/*
! 		 * We can release the exclusive lock as soon as we have truncated.
! 		 * Other backends can't safely access the relation until they have 
! 		 * processed the smgr invalidation that smgrtruncate sent out ... 
! 		 * but that should happen as part of standard invalidation 
! 		 * processing once they acquire lock on the relation.
! 		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 
! 		/*
! 		 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 		 * also touching reltuples, since the tuple count wasn't changed by the
! 		 * truncation.
! 		 */
! 		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
! 		vacrelstats->rel_pages = new_rel_pages;
! 
! 		ereport(elevel,
! 				(errmsg("\"%s\": truncated %u to %u pages",
! 						RelationGetRelationName(onerel),
! 						old_rel_pages, new_rel_pages),
! 				 errdetail("%s.",
! 						   pg_rusage_show(&ru0))));
! 		old_rel_pages = new_rel_pages;
! 	} while (new_rel_pages > vacrelstats->nonempty_pages && 
! 			vacrelstats->lock_waiter_detected);
  }
  
  /*
*************** static BlockNumber
*** 1340,1345 ****
--- 1394,1406 ----
  count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  {
  	BlockNumber blkno;
+ 	instr_time	starttime;
+ 	instr_time	currenttime;
+ 	instr_time	elapsed;
+ 
+ 	/* Initialize the starttime if we check for conflicting lock requests */
+ 	if (autovacuum_truncate_lock_check > 0)
+ 		INSTR_TIME_SET_CURRENT(starttime);
  
  	/* Strange coding of loop control is needed because blkno is unsigned */
  	blkno = vacrelstats->rel_pages;
*************** count_nondeletable_pages(Relation onerel
*** 1352,1357 ****
--- 1413,1451 ----
  		bool		hastup;
  
  		/*
+ 		 * Check if another process requests a lock on our relation.
+ 		 * We are holding an AccessExclusiveLock here, so they will
+ 		 * be waiting. We only do this in autovacuum_truncate_lock_check
+ 		 * millisecond intervals, and we only check if that interval
+ 		 * has elapsed once every 32 blocks to keep the number of
+ 		 * system calls and actual shared lock table lookups to a
+ 		 * minimum.
+ 		 */
+ 		if (autovacuum_truncate_lock_check > 0 && (blkno % 32) == 0)
+ 		{
+ 			INSTR_TIME_SET_CURRENT(currenttime);
+ 			INSTR_TIME_SET_ZERO(elapsed);
+ 			INSTR_TIME_ADD(elapsed, currenttime);
+ 			INSTR_TIME_SUBTRACT(elapsed, starttime);
+ 			if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000) 
+ 					>= autovacuum_truncate_lock_check)
+ 			{
+ 				if (LockHasWaitersRelation(onerel, AccessExclusiveLock))
+ 				{
+ 					ereport(elevel,
+ 							(errmsg("\"%s\": terminating truncate "
+ 									"due to conflicting lock request",
+ 									RelationGetRelationName(onerel))));
+ 
+ 					vacrelstats->lock_waiter_detected = true;
+ 					return blkno;
+ 				}
+ 				INSTR_TIME_SET_ZERO(starttime);
+ 				INSTR_TIME_ADD(starttime, currenttime);
+ 			}
+ 		}
+ 
+ 		/*
  		 * We don't insert a vacuum delay point here, because we have an
  		 * exclusive lock on the table which we want to hold for as short a
  		 * time as possible.  We still need to check for interrupts however.
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6977bcf..b8b8466 100644
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** int			autovacuum_freeze_max_age;
*** 118,123 ****
--- 118,126 ----
  
  int			autovacuum_vac_cost_delay;
  int			autovacuum_vac_cost_limit;
+ int			autovacuum_truncate_lock_check;
+ int			autovacuum_truncate_lock_retry;
+ int			autovacuum_truncate_lock_wait;
  
  int			Log_autovacuum_min_duration = -1;
  
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index a7786d0..e1fa74f 100644
*** a/src/backend/storage/lmgr/lmgr.c
--- b/src/backend/storage/lmgr/lmgr.c
*************** UnlockRelation(Relation relation, LOCKMO
*** 233,238 ****
--- 233,256 ----
  }
  
  /*
+  *		LockHasWaitersRelation
+  *
+  * This is a functiion to check if someone else is waiting on a
+  * lock, we are currently holding.
+  */
+ bool
+ LockHasWaitersRelation(Relation relation, LOCKMODE lockmode)
+ {
+ 	LOCKTAG		tag;
+ 
+ 	SET_LOCKTAG_RELATION(tag,
+ 						 relation->rd_lockInfo.lockRelId.dbId,
+ 						 relation->rd_lockInfo.lockRelId.relId);
+ 
+ 	return LockHasWaiters(&tag, lockmode, false);
+ }
+ 
+ /*
   *		LockRelationIdForSession
   *
   * This routine grabs a session-level lock on the target relation.	The
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 32cc229..605df84 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** ProcLockHashCode(const PROCLOCKTAG *proc
*** 539,544 ****
--- 539,636 ----
  	return lockhash;
  }
  
+ /*
+  * LockHasWaiters -- look up 'locktag' and check if releasing this
+  *		lock would wake up other processes waiting for it.
+  */
+ bool
+ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
+ {
+ 	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+ 	LockMethod	lockMethodTable;
+ 	LOCALLOCKTAG localtag;
+ 	LOCALLOCK  *locallock;
+ 	LOCK	   *lock;
+ 	PROCLOCK   *proclock;
+ 	LWLockId	partitionLock;
+ 	bool		hasWaiters = FALSE;
+ 
+ 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+ 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+ 	lockMethodTable = LockMethods[lockmethodid];
+ 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
+ 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
+ 
+ #ifdef LOCK_DEBUG
+ 	if (LOCK_DEBUG_ENABLED(locktag))
+ 		elog(LOG, "LockHasWaiters: lock [%u,%u] %s",
+ 			 locktag->locktag_field1, locktag->locktag_field2,
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ #endif
+ 
+ 	/*
+ 	 * Find the LOCALLOCK entry for this lock and lockmode
+ 	 */
+ 	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+ 	localtag.lock = *locktag;
+ 	localtag.mode = lockmode;
+ 
+ 	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+ 										  (void *) &localtag,
+ 										  HASH_FIND, NULL);
+ 
+ 	/*
+ 	 * let the caller print its own error message, too. Do not ereport(ERROR).
+ 	 */
+ 	if (!locallock || locallock->nLocks <= 0)
+ 	{
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		return FALSE;
+ 	}
+ 
+ 	/*
+ 	 * Check the shared lock table.
+ 	 */
+ 	partitionLock = LockHashPartitionLock(locallock->hashcode);
+ 
+ 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+ 
+ 	/*
+ 	 * We don't need to re-find the lock or proclock, since we kept their
+ 	 * addresses in the locallock table, and they couldn't have been removed
+ 	 * while we were holding a lock on them.
+ 	 */
+ 	lock = locallock->lock;
+ 	LOCK_PRINT("LockHasWaiters: found", lock, lockmode);
+ 	proclock = locallock->proclock;
+ 	PROCLOCK_PRINT("LockHasWaiters: found", proclock);
+ 
+ 	/*
+ 	 * Double-check that we are actually holding a lock of the type we want to
+ 	 * release.
+ 	 */
+ 	if (!(proclock->holdMask & LOCKBIT_ON(lockmode)))
+ 	{
+ 		PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
+ 		LWLockRelease(partitionLock);
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		RemoveLocalLock(locallock);
+ 		return FALSE;
+ 	}
+ 
+ 	/*
+ 	 * Do the checking.
+ 	 */
+ 	if ((lockMethodTable->conflictTab[lockmode] & lock->waitMask) != 0)
+ 		hasWaiters = TRUE;
+ 
+ 	LWLockRelease(partitionLock);
+ 
+ 	return hasWaiters;
+ }
+ 
  
  /*
   * LockAcquire -- Check for lock conflicts, sleep if conflict found,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 745e7be..d3fd4a3 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_int ConfigureNamesI
*** 1815,1820 ****
--- 1815,1852 ----
  	},
  
  	{
+ 		{"autovacuum_truncate_lock_check", PGC_SIGHUP, AUTOVACUUM,
+ 			gettext_noop("How often autovacuum checks for conflicting lock requests during truncate."),
+ 			NULL,
+ 			GUC_UNIT_MS
+ 		},
+ 		&autovacuum_truncate_lock_check,
+ 		100, 0, 500,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"autovacuum_truncate_lock_retry", PGC_SIGHUP, AUTOVACUUM,
+ 			gettext_noop("How often autovacuum will (re)try to acquire an exclusive lock for truncate."),
+ 			NULL
+ 		},
+ 		&autovacuum_truncate_lock_retry,
+ 		50, 0, 100,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"autovacuum_truncate_lock_wait", PGC_SIGHUP, AUTOVACUUM,
+ 			gettext_noop("How long autovacuum wait between attempts for exclusive lock for truncate."),
+ 			NULL,
+ 			GUC_UNIT_MS
+ 		},
+ 		&autovacuum_truncate_lock_wait,
+ 		20, 0, 50,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"max_files_per_process", PGC_POSTMASTER, RESOURCES_KERNEL,
  			gettext_noop("Sets the maximum number of simultaneously open files for each server process."),
  			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index eeb9b82..ec9e8c4 100644
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 471,476 ****
--- 471,481 ----
  #autovacuum_vacuum_cost_limit = -1	# default vacuum cost limit for
  					# autovacuum, -1 means use
  					# vacuum_cost_limit
+ #autovacuum_truncate_lock_check = 100ms	# default for conflicting lock check
+ 					# 0 means disabled (deadlock code will kill autovacuum),
+ #autovacuum_truncate_lock_retry = 50	# default exclusive lock attempts.
+ #autovacuum_truncate_lock_wait = 20ms	# default wait between exclusive
+ 					# lock attempts for truncate.
  
  
  #------------------------------------------------------------------------------
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index a851758..6e0e286 100644
*** a/src/include/postmaster/autovacuum.h
--- b/src/include/postmaster/autovacuum.h
*************** extern double autovacuum_anl_scale;
*** 26,31 ****
--- 26,34 ----
  extern int	autovacuum_freeze_max_age;
  extern int	autovacuum_vac_cost_delay;
  extern int	autovacuum_vac_cost_limit;
+ extern int	autovacuum_truncate_lock_check;
+ extern int	autovacuum_truncate_lock_retry;
+ extern int	autovacuum_truncate_lock_wait;
  
  /* autovacuum launcher PID, only valid when worker is shutting down */
  extern int	AutovacuumLauncherPid;
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index de340c4..aa79eda 100644
*** a/src/include/storage/lmgr.h
--- b/src/include/storage/lmgr.h
*************** extern void UnlockRelationOid(Oid relid,
*** 31,36 ****
--- 31,37 ----
  extern void LockRelation(Relation relation, LOCKMODE lockmode);
  extern bool ConditionalLockRelation(Relation relation, LOCKMODE lockmode);
  extern void UnlockRelation(Relation relation, LOCKMODE lockmode);
+ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
  
  extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
  extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index d56f0fa..f0eca35 100644
*** a/src/include/storage/lock.h
--- b/src/include/storage/lock.h
*************** extern void LockReleaseAll(LOCKMETHODID 
*** 494,499 ****
--- 494,501 ----
  extern void LockReleaseSession(LOCKMETHODID lockmethodid);
  extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
  extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+ extern bool LockHasWaiters(const LOCKTAG *locktag,
+ 			LOCKMODE lockmode, bool sessionLock);
  extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
  				 LOCKMODE lockmode);
  extern void AtPrepare_Locks(void);
#3Stephen Frost
sfrost@snowman.net
In reply to: Jan Wieck (#1)
Re: autovacuum truncate exclusive lock round two

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

This problem has been discussed before. Those familiar with the
subject please skip the next paragraph.

Apologies if this was already thought-of and ruled out for some reason,
but...

Because all the scanning had been done in parallel to normal DB
activity, it needs to verify that all those blocks are still empty.

Would it be possible to use the FSM to figure out if things have changed
since the last scan..? Does that scan update the FSM, which would then
be updated by another backend in the event that it decided to write
something there? Or do we consider the FSM to be completely
untrustworthy wrt this (and if so, I don't suppose there's any hope to
using the visibility map...)?

The notion of having to double-scan and the AccessExclusiveLock on the
relation are telling me this work-around, while completely possible,
isn't exactly ideal...

Perhaps another option would be a page-level or something which is
larger than per-row (strikes me as a lot of overhead for this and it's
not clear how we'd do it), but less than an entire relation, but there
are certainly pain points there too.

Thanks,

Stephen

#4Jan Wieck
JanWieck@Yahoo.com
In reply to: Stephen Frost (#3)
Re: autovacuum truncate exclusive lock round two

Steven,

On 10/24/2012 10:46 PM, Stephen Frost wrote:

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

This problem has been discussed before. Those familiar with the
subject please skip the next paragraph.

Apologies if this was already thought-of and ruled out for some reason,
but...

Because all the scanning had been done in parallel to normal DB
activity, it needs to verify that all those blocks are still empty.

Would it be possible to use the FSM to figure out if things have changed
since the last scan..? Does that scan update the FSM, which would then
be updated by another backend in the event that it decided to write
something there? Or do we consider the FSM to be completely
untrustworthy wrt this (and if so, I don't suppose there's any hope to
using the visibility map...)?

I honestly don't know if we can trust the FSM enough when it comes to
throwing away heap pages. Can we?

The notion of having to double-scan and the AccessExclusiveLock on the
relation are telling me this work-around, while completely possible,
isn't exactly ideal...

Under normal circumstances with just a few pages to trim off the end
this is no problem. Those pages were the last pages just scanned by this
very autovacuum, so they are found in the shared buffers anyway. All the
second scan does in that case is to fetch the page once more from shared
buffers to be 100% sure, we are not truncating off new tuples. We
definitely need the AccessExclusiveLock to prevent someone from
extending the relation at the end between our check for relation size
and the truncate. Fetching 50 empty blocks from the buffer cache while
at it isn't that big of a deal and that is what it normally looks like.

The problem case this patch is dealing with is rolling window tables
that experienced some bloat. The typical example is a log table, that
has new data constantly added and the oldest data constantly purged out.
This data normally rotates through some blocks like a rolling window. If
for some reason (purging turned off for example) this table bloats by
several GB and later shrinks back to its normal content, soon all the
used blocks are at the beginning of the heap and we find tens of
thousands of empty pages at the end. Only now does the second scan take
more than 1000ms and autovacuum is at risk to get killed while at it.

Since we have experienced this problem several times now on our
production systems, something clearly needs to be done. But IMHO it
doesn't happen often enough to take any risk here.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#4)
Re: autovacuum truncate exclusive lock round two

Jan Wieck <JanWieck@Yahoo.com> writes:

On 10/24/2012 10:46 PM, Stephen Frost wrote:

Would it be possible to use the FSM to figure out if things have changed
since the last scan..? Does that scan update the FSM, which would then
be updated by another backend in the event that it decided to write
something there? Or do we consider the FSM to be completely
untrustworthy wrt this (and if so, I don't suppose there's any hope to
using the visibility map...)?

I honestly don't know if we can trust the FSM enough when it comes to
throwing away heap pages. Can we?

No. Backends are under no obligation to update FSM for each individual
tuple insertion, and typically don't do so.

More to the point, you have to take AccessExclusiveLock *anyway*,
because this is interlocking not only against new insertions but plain
read-only seqscans: if a seqscan falls off the end of the table it will
be very unhappy. So I don't see where we'd buy anything by consulting
the FSM.

regards, tom lane

#6Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#5)
Re: autovacuum truncate exclusive lock round two

On 10/25/2012 9:45 AM, Tom Lane wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

On 10/24/2012 10:46 PM, Stephen Frost wrote:

Would it be possible to use the FSM to figure out if things have changed
since the last scan..? Does that scan update the FSM, which would then
be updated by another backend in the event that it decided to write
something there? Or do we consider the FSM to be completely
untrustworthy wrt this (and if so, I don't suppose there's any hope to
using the visibility map...)?

I honestly don't know if we can trust the FSM enough when it comes to
throwing away heap pages. Can we?

No. Backends are under no obligation to update FSM for each individual
tuple insertion, and typically don't do so.

More to the point, you have to take AccessExclusiveLock *anyway*,
because this is interlocking not only against new insertions but plain
read-only seqscans: if a seqscan falls off the end of the table it will
be very unhappy. So I don't see where we'd buy anything by consulting
the FSM.

Thank you.

One thing that I haven't mentioned yet is that with this patch, we could
actually insert a vacuum_delay_point() into the loop in
count_nondeletable_pages(). We no longer cling to the exclusive lock but
rather get out of the way as soon as somebody needs the table. Under
this condition we no longer need to do the second scan full bore.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#7Stephen Frost
sfrost@snowman.net
In reply to: Jan Wieck (#4)
Re: autovacuum truncate exclusive lock round two

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

The problem case this patch is dealing with is rolling window tables
that experienced some bloat. The typical example is a log table,
that has new data constantly added and the oldest data constantly
purged out. This data normally rotates through some blocks like a
rolling window. If for some reason (purging turned off for example)
this table bloats by several GB and later shrinks back to its normal
content, soon all the used blocks are at the beginning of the heap
and we find tens of thousands of empty pages at the end. Only now
does the second scan take more than 1000ms and autovacuum is at risk
to get killed while at it.

My concern is that this could certainly also happen to a heavily updated
table in an OLTP type of environment where the requirement to take a
heavy lock to clean it up might prevent it from ever happening.. I was
simply hoping we could find a mechanism to lock just those pages we're
getting ready to nuke rather than the entire relation. Perhaps we can
consider how to make those changes alongside of changes to eliminate or
reduce the extent locking that has been painful (for me at least) when
doing massive parallel loads into a table.

Since we have experienced this problem several times now on our
production systems, something clearly needs to be done. But IMHO it
doesn't happen often enough to take any risk here.

I'm not advocating a 'do-nothing' approach, was just looking for another
option that might allow for this work to happen on the heap in parallel
with regular access. Since we havn't got any way to do that currently,
+1 for moving forward with this as it clearly improves the current
situation.

Thanks,

Stephen

#8Jan Wieck
JanWieck@Yahoo.com
In reply to: Stephen Frost (#7)
Re: autovacuum truncate exclusive lock round two

On 10/25/2012 10:12 AM, Stephen Frost wrote:

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

The problem case this patch is dealing with is rolling window tables
that experienced some bloat. The typical example is a log table,
that has new data constantly added and the oldest data constantly
purged out. This data normally rotates through some blocks like a
rolling window. If for some reason (purging turned off for example)
this table bloats by several GB and later shrinks back to its normal
content, soon all the used blocks are at the beginning of the heap
and we find tens of thousands of empty pages at the end. Only now
does the second scan take more than 1000ms and autovacuum is at risk
to get killed while at it.

My concern is that this could certainly also happen to a heavily updated
table in an OLTP type of environment where the requirement to take a
heavy lock to clean it up might prevent it from ever happening.. I was
simply hoping we could find a mechanism to lock just those pages we're
getting ready to nuke rather than the entire relation. Perhaps we can
consider how to make those changes alongside of changes to eliminate or
reduce the extent locking that has been painful (for me at least) when
doing massive parallel loads into a table.

I've been testing this with loads of 20 writes/s to that bloated table.
Preventing not only the clean up, but the following ANALYZE as well is
precisely what happens. There may be multiple ways how to get into this
situation, but once you're there the symptoms are the same. Vacuum fails
to truncate it and causing a 1 second hiccup every minute, while vacuum
is holding the exclusive lock until the deadlock detection code of
another transaction kills it.

My patch doesn't change the logic how we ensure that we don't zap any
data by accident with the truncate and Tom's comments suggest we should
stick to it. It only makes autovacuum check frequently if the
AccessExclusiveLock is actually blocking anyone and then get out of the
way.

I would rather like to discuss any ideas how to do all this without 3
new GUCs.

In the original code, the maximum delay that autovacuum can cause by
holding the exclusive lock is one deadlock_timeout (default 1s). It
would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
the interval to check for a conflicting lock request. For another
transaction that needs to access the table this is 10 times faster than
it is now and still guarantees that autovacuum will make some progress
with the truncate.

The other two GUCs control how often and how fast autovacuum tries to
acquire the exclusive lock in the first place. Since we actively release
the lock *because someone needs it* it is pretty much guaranteed that
the immediate next lock attempt fails. We on purpose do a
ConditionalLockRelation() because there is a chance to deadlock. The
current code only tries one lock attempt and gives up immediately. I
don't know from what to derive a good value for how long to retry, but
the nap time in between tries could be a hardcoded 20ms or using the
cost based vacuum nap time (which defaults to 20ms).

Any other ideas are welcome.

Thanks,
Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jan Wieck (#8)
Re: autovacuum truncate exclusive lock round two

Jan Wieck wrote:

In the original code, the maximum delay that autovacuum can cause by
holding the exclusive lock is one deadlock_timeout (default 1s). It
would appear reasonable to me to use max(deadlock_timeout/10,10ms)
as the interval to check for a conflicting lock request. For another
transaction that needs to access the table this is 10 times faster
than it is now and still guarantees that autovacuum will make some
progress with the truncate.

So you would be calling GetCurrentTimestamp() continuously? Since you
mentioned adding a vacuum delay point I wonder if it would make sense to
test for lockers each time it would consider going to sleep, instead.
(One hazard to keep in mind is the case where no vacuum delay is
configured.)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Jan Wieck
JanWieck@Yahoo.com
In reply to: Alvaro Herrera (#9)
Re: autovacuum truncate exclusive lock round two

On 10/25/2012 12:24 PM, Alvaro Herrera wrote:

Jan Wieck wrote:

In the original code, the maximum delay that autovacuum can cause by
holding the exclusive lock is one deadlock_timeout (default 1s). It
would appear reasonable to me to use max(deadlock_timeout/10,10ms)
as the interval to check for a conflicting lock request. For another
transaction that needs to access the table this is 10 times faster
than it is now and still guarantees that autovacuum will make some
progress with the truncate.

So you would be calling GetCurrentTimestamp() continuously? Since you
mentioned adding a vacuum delay point I wonder if it would make sense to
test for lockers each time it would consider going to sleep, instead.
(One hazard to keep in mind is the case where no vacuum delay is
configured.)

Depends on your definition of "continuously". If doing one
INSTR_TIME_SET_CURRENT(), which on Unix boils down to a gettimeofday(),
every 32 ReadBufferExtended() calls counts as continuously, then yes.

Adding a vacuum_delay_point() is something we should consider. However,
the vacuum_delay_point() call simply naps when enough cost has been
racked up. You don't know if the next call will nap or not. We would
have to extend that functionality with some vacuum_delay_would_nap()
call to do what you suggest.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#11Amit Kapila
amit.kapila@huawei.com
In reply to: Jan Wieck (#8)
Re: autovacuum truncate exclusive lock round two

On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:

On 10/25/2012 10:12 AM, Stephen Frost wrote:

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

The problem case this patch is dealing with is rolling window tables
that experienced some bloat. The typical example is a log table,
that has new data constantly added and the oldest data constantly
purged out. This data normally rotates through some blocks like a
rolling window. If for some reason (purging turned off for example)
this table bloats by several GB and later shrinks back to its normal
content, soon all the used blocks are at the beginning of the heap
and we find tens of thousands of empty pages at the end. Only now
does the second scan take more than 1000ms and autovacuum is at risk
to get killed while at it.

My concern is that this could certainly also happen to a heavily

updated

table in an OLTP type of environment where the requirement to take a
heavy lock to clean it up might prevent it from ever happening.. I

was

simply hoping we could find a mechanism to lock just those pages we're
getting ready to nuke rather than the entire relation. Perhaps we can
consider how to make those changes alongside of changes to eliminate

or

reduce the extent locking that has been painful (for me at least) when
doing massive parallel loads into a table.

I've been testing this with loads of 20 writes/s to that bloated table.
Preventing not only the clean up, but the following ANALYZE as well is
precisely what happens. There may be multiple ways how to get into this
situation, but once you're there the symptoms are the same. Vacuum fails
to truncate it and causing a 1 second hiccup every minute, while vacuum
is holding the exclusive lock until the deadlock detection code of
another transaction kills it.

My patch doesn't change the logic how we ensure that we don't zap any
data by accident with the truncate and Tom's comments suggest we should
stick to it. It only makes autovacuum check frequently if the
AccessExclusiveLock is actually blocking anyone and then get out of the
way.

I would rather like to discuss any ideas how to do all this without 3
new GUCs.

In the original code, the maximum delay that autovacuum can cause by
holding the exclusive lock is one deadlock_timeout (default 1s). It
would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
the interval to check for a conflicting lock request. For another
transaction that needs to access the table this is 10 times faster than
it is now and still guarantees that autovacuum will make some progress
with the truncate.

One other way could be to check after every few pages for a conflicting
lock request.

The other two GUCs control how often and how fast autovacuum tries to
acquire the exclusive lock in the first place. Since we actively release
the lock *because someone needs it* it is pretty much guaranteed that
the immediate next lock attempt fails. We on purpose do a
ConditionalLockRelation() because there is a chance to deadlock. The
current code only tries one lock attempt and gives up immediately. I
don't know from what to derive a good value for how long to retry,

Can't we do something like, after nap check for conditional lock and if it
didn't get
then get lock unconditionally.
The reason why after your implementation it might be okay to have lock
unconditionally after one try is that
anyway after every few pages or after small time, it will release the lock
if there is any waiter.

but
the nap time in between tries could be a hardcoded 20ms or using the
cost based vacuum nap time (which defaults to 20ms).

I think using cost based vacuum nap time or default value is good.

Adding new parameters might have user/administrator overhead, it is always
better if it can be intelligently decided by database itself.
However if you feel these are parameters which can vary based on different
kind of usage, then I think it is better to expose it through configuration
parameters to users.

With Regards,
Amit Kapila.

#12Jan Wieck
JanWieck@Yahoo.com
In reply to: Amit Kapila (#11)
Re: autovacuum truncate exclusive lock round two

On 10/26/2012 1:29 AM, Amit Kapila wrote:

On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:

On 10/25/2012 10:12 AM, Stephen Frost wrote:

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

The problem case this patch is dealing with is rolling window tables
that experienced some bloat. The typical example is a log table,
that has new data constantly added and the oldest data constantly
purged out. This data normally rotates through some blocks like a
rolling window. If for some reason (purging turned off for example)
this table bloats by several GB and later shrinks back to its normal
content, soon all the used blocks are at the beginning of the heap
and we find tens of thousands of empty pages at the end. Only now
does the second scan take more than 1000ms and autovacuum is at risk
to get killed while at it.

My concern is that this could certainly also happen to a heavily

updated

table in an OLTP type of environment where the requirement to take a
heavy lock to clean it up might prevent it from ever happening.. I

was

simply hoping we could find a mechanism to lock just those pages we're
getting ready to nuke rather than the entire relation. Perhaps we can
consider how to make those changes alongside of changes to eliminate

or

reduce the extent locking that has been painful (for me at least) when
doing massive parallel loads into a table.

I've been testing this with loads of 20 writes/s to that bloated table.
Preventing not only the clean up, but the following ANALYZE as well is
precisely what happens. There may be multiple ways how to get into this
situation, but once you're there the symptoms are the same. Vacuum fails
to truncate it and causing a 1 second hiccup every minute, while vacuum
is holding the exclusive lock until the deadlock detection code of
another transaction kills it.

My patch doesn't change the logic how we ensure that we don't zap any
data by accident with the truncate and Tom's comments suggest we should
stick to it. It only makes autovacuum check frequently if the
AccessExclusiveLock is actually blocking anyone and then get out of the
way.

I would rather like to discuss any ideas how to do all this without 3
new GUCs.

In the original code, the maximum delay that autovacuum can cause by
holding the exclusive lock is one deadlock_timeout (default 1s). It
would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
the interval to check for a conflicting lock request. For another
transaction that needs to access the table this is 10 times faster than
it is now and still guarantees that autovacuum will make some progress
with the truncate.

One other way could be to check after every few pages for a conflicting
lock request.

How is this any different from what my patch does? Did you even look at
the code?

The other two GUCs control how often and how fast autovacuum tries to
acquire the exclusive lock in the first place. Since we actively release
the lock *because someone needs it* it is pretty much guaranteed that
the immediate next lock attempt fails. We on purpose do a
ConditionalLockRelation() because there is a chance to deadlock. The
current code only tries one lock attempt and gives up immediately. I
don't know from what to derive a good value for how long to retry,

Can't we do something like, after nap check for conditional lock and if it
didn't get
then get lock unconditionally.

No, we cannot. This is also well documented in the code.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#13Amit Kapila
amit.kapila@huawei.com
In reply to: Amit Kapila (#11)
Re: autovacuum truncate exclusive lock round two

On Friday, October 26, 2012 10:59 AM Amit Kapila wrote:

On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:

On 10/25/2012 10:12 AM, Stephen Frost wrote:

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

The problem case this patch is dealing with is rolling window

tables

that experienced some bloat. The typical example is a log table,
that has new data constantly added and the oldest data constantly
purged out. This data normally rotates through some blocks like a
rolling window. If for some reason (purging turned off for example)
this table bloats by several GB and later shrinks back to its

normal

content, soon all the used blocks are at the beginning of the heap
and we find tens of thousands of empty pages at the end. Only now
does the second scan take more than 1000ms and autovacuum is at

risk

to get killed while at it.

My concern is that this could certainly also happen to a heavily

updated

table in an OLTP type of environment where the requirement to take a
heavy lock to clean it up might prevent it from ever happening.. I

was

simply hoping we could find a mechanism to lock just those pages

we're

getting ready to nuke rather than the entire relation. Perhaps we

can

consider how to make those changes alongside of changes to eliminate

or

reduce the extent locking that has been painful (for me at least)

when

doing massive parallel loads into a table.

I've been testing this with loads of 20 writes/s to that bloated

table.

Preventing not only the clean up, but the following ANALYZE as well is
precisely what happens. There may be multiple ways how to get into

this

situation, but once you're there the symptoms are the same. Vacuum

fails

to truncate it and causing a 1 second hiccup every minute, while

vacuum

is holding the exclusive lock until the deadlock detection code of
another transaction kills it.

My patch doesn't change the logic how we ensure that we don't zap any
data by accident with the truncate and Tom's comments suggest we

should

stick to it. It only makes autovacuum check frequently if the
AccessExclusiveLock is actually blocking anyone and then get out of

the

way.

I would rather like to discuss any ideas how to do all this without 3
new GUCs.

In the original code, the maximum delay that autovacuum can cause by
holding the exclusive lock is one deadlock_timeout (default 1s). It
would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
the interval to check for a conflicting lock request. For another
transaction that needs to access the table this is 10 times faster

than

it is now and still guarantees that autovacuum will make some progress
with the truncate.

One other way could be to check after every few pages for a
conflicting
lock request.

The other two GUCs control how often and how fast autovacuum tries to
acquire the exclusive lock in the first place. Since we actively

release

the lock *because someone needs it* it is pretty much guaranteed that
the immediate next lock attempt fails. We on purpose do a
ConditionalLockRelation() because there is a chance to deadlock. The
current code only tries one lock attempt and gives up immediately. I
don't know from what to derive a good value for how long to retry,

Can't we do something like, after nap check for conditional lock and
if it
didn't get
then get lock unconditionally.
The reason why after your implementation it might be okay to have lock
unconditionally after one try is that
anyway after every few pages or after small time, it will release the
lock
if there is any waiter.

I am sorry, at this point trying to take unconditional X lock can lead to
deadlock, so above is not possible.

Show quoted text

but
the nap time in between tries could be a hardcoded 20ms or using the
cost based vacuum nap time (which defaults to 20ms).

I think using cost based vacuum nap time or default value is good.

Adding new parameters might have user/administrator overhead, it is
always
better if it can be intelligently decided by database itself.
However if you feel these are parameters which can vary based on
different
kind of usage, then I think it is better to expose it through
configuration
parameters to users.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Amit Kapila
amit.kapila@huawei.com
In reply to: Jan Wieck (#12)
Re: autovacuum truncate exclusive lock round two

On Friday, October 26, 2012 11:50 AM Jan Wieck wrote:

On 10/26/2012 1:29 AM, Amit Kapila wrote:

On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:

On 10/25/2012 10:12 AM, Stephen Frost wrote:

Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:

The problem case this patch is dealing with is rolling window

tables

that experienced some bloat. The typical example is a log table,
that has new data constantly added and the oldest data constantly
purged out. This data normally rotates through some blocks like a
rolling window. If for some reason (purging turned off for

example)

this table bloats by several GB and later shrinks back to its

normal

content, soon all the used blocks are at the beginning of the heap
and we find tens of thousands of empty pages at the end. Only now
does the second scan take more than 1000ms and autovacuum is at

risk

to get killed while at it.

My concern is that this could certainly also happen to a heavily

updated

table in an OLTP type of environment where the requirement to take

a

heavy lock to clean it up might prevent it from ever happening.. I

was

simply hoping we could find a mechanism to lock just those pages

we're

getting ready to nuke rather than the entire relation. Perhaps we

can

consider how to make those changes alongside of changes to

eliminate

or

reduce the extent locking that has been painful (for me at least)

when

doing massive parallel loads into a table.

I've been testing this with loads of 20 writes/s to that bloated

table.

Preventing not only the clean up, but the following ANALYZE as well

is

precisely what happens. There may be multiple ways how to get into

this

situation, but once you're there the symptoms are the same. Vacuum

fails

to truncate it and causing a 1 second hiccup every minute, while

vacuum

is holding the exclusive lock until the deadlock detection code of
another transaction kills it.

My patch doesn't change the logic how we ensure that we don't zap any
data by accident with the truncate and Tom's comments suggest we

should

stick to it. It only makes autovacuum check frequently if the
AccessExclusiveLock is actually blocking anyone and then get out of

the

way.

I would rather like to discuss any ideas how to do all this without 3
new GUCs.

In the original code, the maximum delay that autovacuum can cause by
holding the exclusive lock is one deadlock_timeout (default 1s). It
would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
the interval to check for a conflicting lock request. For another
transaction that needs to access the table this is 10 times faster

than

it is now and still guarantees that autovacuum will make some

progress

with the truncate.

One other way could be to check after every few pages for a

conflicting

lock request.

How is this any different from what my patch does?

The difference is that in the patch it checks for waiters by using 2
parameters autovacuum_truncate_lock_check and blkno%32 and what I
had mentioned was to check only based on blkno.
Will it effect too much if we directly check for waiters after every 32
(any feasible number) blocks?

Did you even look at the code?

I haven't looked at code when I had given reply to your previous mail. But
now I have checked it.

With Regards,
Amit Kapila.

#15Jan Wieck
JanWieck@Yahoo.com
In reply to: Amit Kapila (#14)
Re: autovacuum truncate exclusive lock round two

On 10/26/2012 6:35 AM, Amit Kapila wrote:

On Friday, October 26, 2012 11:50 AM Jan Wieck wrote:

On 10/26/2012 1:29 AM, Amit Kapila wrote:

One other way could be to check after every few pages for a

conflicting

lock request.

How is this any different from what my patch does?

The difference is that in the patch it checks for waiters by using 2
parameters autovacuum_truncate_lock_check and blkno%32 and what I
had mentioned was to check only based on blkno.
Will it effect too much if we directly check for waiters after every 32
(any feasible number) blocks?

The blkno%32 is there to not do the gettimeofday() call too often. But
relying on the blkno alone is IMHO not a good idea. It had to be a
number small enough so that even on a busy system and when the pages
have to be read from disk, vacuum checks and releases the lock quickly.
But large enough so that it doesn't create a significant amount of
spinlock calls in the lmgr. We would end up with another parameter,
number of blocks, that is a lot harder to estimate a good value for.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#16Robert Haas
robertmhaas@gmail.com
In reply to: Jan Wieck (#1)
Re: autovacuum truncate exclusive lock round two

On Wed, Oct 24, 2012 at 4:20 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

This patch does introduce three new postgresql.conf parameters, which I
would be happy to get rid of if we could derive them from something else.
Something based on the deadlock timeout may be possible.

autovacuum_truncate_lock_check = 100ms # how frequent to check
# for conflicting locks
autovacuum_truncate_lock_retry = 50 # how often to try acquiring
# the exclusive lock
autovacuum_truncate_lock_wait = 20ms # nap in between attempts

+1 for this general approach.

As you suggested downthread, I think that hard-coding
autovacuum_truncate_lock_check to one-tenth of the deadlock timeout
should be just fine. For the other two parameters, I doubt we need to
make them configurable at all. It's not exactly clear what to set
them to, but it does seem clear that the down side of setting them
incorrectly isn't very much as long as the defaults are roughly sane.
Personally, I'd be inclined to retry less frequently but over a
slightly longer time period - say twenty retries, one after every
100ms. But I wouldn't be upset if we settled on what you've got here,
either. We just don't want to let the total time we spend waiting for
the lock get too long, because that means pinning down an auto-vacuum
worker that might be critically needed elsewhere. So the product of
autovacuum_truncate_lock_retry and autovacuum_truncate_lock_wait
probably should not be more than a couple of seconds.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Jan Wieck (#1)
Re: autovacuum truncate exclusive lock round two

Jan Wieck <JanWieck@Yahoo.com> writes:

Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to
periodically check, if there is a conflicting lock request waiting. If not,
keep going. If there is a waiter, truncate the relation to the point checked
thus far, release the AccessExclusiveLock, then loop back to where we
acquire this lock in the first place and continue checking/truncating.

I think that maybe we could just bail out after releasing the
AccessExclusiveLock and trust autovacuum to get back to truncating that
relation later. That would allow removing 2 of the 3 GUCs below:

autovacuum_truncate_lock_check = 100ms # how frequent to check
# for conflicting locks

This is the one remaining. Could we maybe check for lock conflict after
every move backward a page, or some multiple thereof? The goal would be
to ensure that progress is made, while also being aware of concurrent
activity, ala CHECK_FOR_INTERRUPTS().

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dimitri Fontaine (#17)
Re: autovacuum truncate exclusive lock round two

Dimitri Fontaine wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to
periodically check, if there is a conflicting lock request waiting. If not,
keep going. If there is a waiter, truncate the relation to the point checked
thus far, release the AccessExclusiveLock, then loop back to where we
acquire this lock in the first place and continue checking/truncating.

I think that maybe we could just bail out after releasing the
AccessExclusiveLock and trust autovacuum to get back to truncating that
relation later.

That doesn't work, because the truncating code is not reached unless
vacuuming actually took place. So if you interrupt it, it will just not
get called again later. Maybe we could have autovacuum somehow invoke
that separately, but that would require that the fact that truncation
was aborted is kept track of somewhere.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19Amit Kapila
amit.kapila@huawei.com
In reply to: Alvaro Herrera (#18)
Re: autovacuum truncate exclusive lock round two

On Friday, November 16, 2012 4:09 AM Alvaro Herrera wrote:

Dimitri Fontaine wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

Use this lmgr feature inside count_nondeletable_pages() of

vacuumlazy.c to

periodically check, if there is a conflicting lock request waiting.

If not,

keep going. If there is a waiter, truncate the relation to the point

checked

thus far, release the AccessExclusiveLock, then loop back to where

we

acquire this lock in the first place and continue

checking/truncating.

I think that maybe we could just bail out after releasing the
AccessExclusiveLock and trust autovacuum to get back to truncating

that

relation later.

That doesn't work, because the truncating code is not reached unless
vacuuming actually took place. So if you interrupt it, it will just not
get called again later. Maybe we could have autovacuum somehow invoke
that separately, but that would require that the fact that truncation
was aborted is kept track of somewhere.

Won't it have a chance to be handled next time when vacuum will trigger due
to updates/deletes on some other pages.

OTOH, may be next time again the same thing happens and it was not able to
complete the truncate.
So I think it's better to complete first time only, but may be using some
heuristic time for wait and retry rather than
with configuration variables.

With Regards,
Amit Kapila

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jan Wieck (#15)
Re: autovacuum truncate exclusive lock round two

Jan,

Are you posting an updated patch?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Kevin Grittner
kgrittn@mail.com
In reply to: Alvaro Herrera (#20)
Re: autovacuum truncate exclusive lock round two

Alvaro Herrera wrote:

Are you posting an updated patch?

Well, there wasn't exactly a consensus on what should change, so I'll
throw some initial review comments out even though I still need to
check some things in the code and do more testing.

The patch applied cleanly, compiled without warnings, and passed all
tests in `make check-world`.

The previous discussion seemed to me to achieve consensus on the need
for the feature, but a general dislike for adding so many new GUC
settings to configure it. I concur on that. I agree with the feelings
of others that just using deadlock_timeout / 10 (probably clamped to
a minimum of 10ms) would be good instead of
autovacuum_truncate_lock_check. I agree that the values of the other
two settings probably aren't too critical as long as they are fairly
reasonable, which I would define as being 20ms to 100ms with with
retries lasting no more than 2 seconds.  I'm inclined to suggest a
total time of deadlock_timeout * 2 clamped to 2 seconds, but maybe
that is over-engineering it.

There was also a mention of possibly inserting a vacuum_delay_point()
call outside of the AccessExclusiveLock. I don't feel strongly about
it, but if the page scanning cost can be tracked easily, I guess it
is better to do that.

Other than simplifying the code based on eliminating the new GUCs,
the coding issues I found were:

TRUE and FALSE were used as literals.  Since true and false are used
about 10 times as often and current code in the modified files is
mixed, I would recommend the lowercase form. We decided it wasn't
worth the trouble to convert the minority usage over, but I don't see
any reason to add new instances of the minority case.

In LockHasWaiters() the partition lock is acquired using
LW_EXCLUSIVE. I don't see why that can't be LW_SHARED. Was there a
reason for this, or was it just not changed after copy/paste?

I still need to review the timing calls, since I'm not familiar with
them so it wasn't immediately obvious to me whether they were being
used correctly. I have no reason to believe that they aren't, but
feel I should check.

Also, I want to do another pass looking just for off-by-one errors on
blkno. Again, I have no reason to believe that there is a problem; it
just seems like it would be a particularly easy type of mistake to
make and miss when a key structure has this field:

 BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */

And I want to test more.

But if a new version could be created based on the above, that would
be great. Chances seem relatively slim at this point that there will
be much else to change, if anything.

-Kevin

#22Kevin Grittner
kgrittn@mail.com
In reply to: Kevin Grittner (#21)
Re: autovacuum truncate exclusive lock round two

Kevin Grittner wrote:

I still need to review the timing calls, since I'm not familiar
with them so it wasn't immediately obvious to me whether they
were being used correctly. I have no reason to believe that they
aren't, but feel I should check.

It seems odd to me that assignment of one instr_time to another is
done with INSTR_TIME_SET_ZERO() of the target followed by
INSTR_TIME_ADD() with the target and the source. It seems to me
that simple assignment would be more readable, and I can't see any
down side.

Why shouldn't:

   INSTR_TIME_SET_ZERO(elapsed);
   INSTR_TIME_ADD(elapsed, currenttime);
   INSTR_TIME_SUBTRACT(elapsed, starttime);

instead be?:

   elapsed = currenttime;
   INSTR_TIME_SUBTRACT(elapsed, starttime);

And why shouldn't:

   INSTR_TIME_SET_ZERO(starttime);
   INSTR_TIME_ADD(starttime, currenttime);

instead be?:

   starttime = currenttime;

Resetting starttime this way seems especially odd.

Also, I want to do another pass looking just for off-by-one
errors on blkno. Again, I have no reason to believe that there is
a problem; it just seems like it would be a particularly easy
type of mistake to make and miss when a key structure has this
field:

  BlockNumber nonempty_pages;
     /* actually, last nonempty page + 1 */

No problems found with that.

And I want to test more.

The patch worked as advertised in all my tests, but I became
uncomforatable with the games being played with the last autovacuum
timestamp and the count of dead tuples. Sure, that causes
autovacuum to kick back in until it has dealt with the truncation,
but it makes it hard for a human looking at the stat views to see
what's going on, and I'm not sure it wouldn't lead to bad plans due
to stale statistics.

Personally, I would rather see us add a boolean to indicate that
autovacuum was needed (regardless of the math on the other columns)
and use that to control the retries -- leaving the other columns
free to reflect reality.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#22)
Re: autovacuum truncate exclusive lock round two

On 11/28/2012 3:33 PM, Kevin Grittner wrote:

Kevin Grittner wrote:

I still need to review the timing calls, since I'm not familiar
with them so it wasn't immediately obvious to me whether they
were being used correctly. I have no reason to believe that they
aren't, but feel I should check.

It seems odd to me that assignment of one instr_time to another is
done with INSTR_TIME_SET_ZERO() of the target followed by
INSTR_TIME_ADD() with the target and the source. It seems to me
that simple assignment would be more readable, and I can't see any
down side.

Why shouldn't:

INSTR_TIME_SET_ZERO(elapsed);
INSTR_TIME_ADD(elapsed, currenttime);
INSTR_TIME_SUBTRACT(elapsed, starttime);

instead be?:

elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);

And why shouldn't:

INSTR_TIME_SET_ZERO(starttime);
INSTR_TIME_ADD(starttime, currenttime);

instead be?:

starttime = currenttime;

Resetting starttime this way seems especially odd.

instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is

starttime = currenttime;

portable if those are structs?

Also, I want to do another pass looking just for off-by-one
errors on blkno. Again, I have no reason to believe that there is
a problem; it just seems like it would be a particularly easy
type of mistake to make and miss when a key structure has this
field:

BlockNumber nonempty_pages;
/* actually, last nonempty page + 1 */

No problems found with that.

And I want to test more.

The patch worked as advertised in all my tests, but I became
uncomforatable with the games being played with the last autovacuum
timestamp and the count of dead tuples. Sure, that causes
autovacuum to kick back in until it has dealt with the truncation,
but it makes it hard for a human looking at the stat views to see
what's going on, and I'm not sure it wouldn't lead to bad plans due
to stale statistics.

Personally, I would rather see us add a boolean to indicate that
autovacuum was needed (regardless of the math on the other columns)
and use that to control the retries -- leaving the other columns
free to reflect reality.

You mean to extend the stats by yet another column? The thing is that
this whole case happens rarely. We see this every other month or so and
only on a rolling window table after it got severely bloated due to some
abnormal use (purging disabled for some operational reason). The whole
situation resolves itself after a few minutes to maybe an hour or two.

Personally I don't think living with a few wrong stats on one table for
that time is so bad that it warrants changing that much more code.

Lower casing TRUE/FALSE will be done.

If the LW_SHARED is enough in LockHasWaiters(), that's fine too.

I think we have a consensus that the check interval should be derived
from deadlock_timeout/10 clamped to 10ms. I'm going to remove that GUC.

About the other two, I'm just not sure. Maybe using half the value as
for the lock waiter interval as the lock retry interval, again clamped
to 10ms, and simply leaving one GUC that controls how long autovacuum
should retry this. I'm not too worried about the retry interval.
However, the problem with that overall retry duration is that this value
very much depends on the usage pattern of the database. If long running
transactions (like >5 seconds) are the norm, then a hard coded value of
2 seconds before autovacuum gives up isn't going to help much.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#23)
Re: autovacuum truncate exclusive lock round two

Jan Wieck <JanWieck@Yahoo.com> writes:

On 11/28/2012 3:33 PM, Kevin Grittner wrote:

Resetting starttime this way seems especially odd.

instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
starttime = currenttime;
portable if those are structs?

Sure. We rely on struct assignment in lots of places.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#24)
Re: autovacuum truncate exclusive lock round two

On 11/29/2012 9:46 AM, Tom Lane wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

On 11/28/2012 3:33 PM, Kevin Grittner wrote:

Resetting starttime this way seems especially odd.

instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
starttime = currenttime;
portable if those are structs?

Sure. We rely on struct assignment in lots of places.

Will be done then.

Thanks,
Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#22)
1 attachment(s)
Re: autovacuum truncate exclusive lock round two

Attached is a new patch that addresses most of the points raised in
discussion before.

1) Most of the configuration variables are derived from deadlock_timeout
now. The "check for conflicting lock request" interval is
deadlock_timeout/10, clamped to 10ms. The "try to acquire exclusive
lock" interval is deadlock_timeout/20, also clamped to 10ms. The only
GUC variable remaining is autovacuum_truncate_lock_try=2000ms with a
range from 0 (just try once) to 20000ms.

I'd like to point out that this is a significant change in functionality
as without the config option for the check interval, there is no longer
any possibility to disable the call to LockHasWaiters() and return to
the original (deadlock code kills autovacuum) behavior.

2) The partition lock in LockHasWaiters() was lowered to LW_SHARED. The
LW_EXCLUSIVE was indeed a copy/paste result.

3) The instr_time handling was simplified as suggested.

4) Lower case TRUE/FALSE.

I did not touch the part about suppressing the stats and the ANALYZE
step of "auto vacuum+analyze". The situation is no different today. When
the deadlock code kills autovacuum, stats aren't updated either. And
this patch is meant to cause autovacuum to finish the truncate in a few
minutes or hours, so that the situation fixes itself.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachments:

autovacuum-truncate-lock-2.difftext/x-patch; name=autovacuum-truncate-lock-2.diffDownload
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c9253a9..e8f88ad 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 52,62 ****
--- 52,64 ----
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
  #include "storage/lmgr.h"
+ #include "storage/proc.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
  #include "utils/pg_rusage.h"
  #include "utils/timestamp.h"
  #include "utils/tqual.h"
+ #include "portability/instr_time.h"
  
  
  /*
*************** typedef struct LVRelStats
*** 103,108 ****
--- 105,111 ----
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*************** static TransactionId FreezeLimit;
*** 114,119 ****
--- 117,126 ----
  
  static BufferAccessStrategy vac_strategy;
  
+ static int autovacuum_truncate_lock_check;
+ static int autovacuum_truncate_lock_retry;
+ static int autovacuum_truncate_lock_wait;
+ 
  
  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 ****
--- 200,207 ----
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
  	vacrelstats->old_rel_tuples = onerel->rd_rel->reltuples;
  	vacrelstats->num_index_scans = 0;
+ 	vacrelstats->pages_removed = 0;
+ 	vacrelstats->lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 ****
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
--- 268,285 ----
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt 
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats->lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
+ 	else
+ 		vacstmt->options &= ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
*************** lazy_truncate_heap(Relation onerel, LVRe
*** 1255,1334 ****
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
  	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
  
  	/*
! 	 * Now that we have exclusive lock, look to see if the rel has grown
! 	 * whilst we were vacuuming with non-exclusive lock.  If so, give up; the
! 	 * newly added pages presumably contain non-deletable tuples.
  	 */
! 	new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 	if (new_rel_pages != old_rel_pages)
  	{
  		/*
! 		 * Note: we intentionally don't update vacrelstats->rel_pages with the
! 		 * new rel size here.  If we did, it would amount to assuming that the
! 		 * new pages are empty, which is unlikely.	Leaving the numbers alone
! 		 * amounts to assuming that the new pages have the same tuple density
! 		 * as existing ones, which is less unlikely.
  		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Scan backwards from the end to verify that the end pages actually
! 	 * contain no tuples.  This is *necessary*, not optional, because other
! 	 * backends could have added tuples to these pages whilst we were
! 	 * vacuuming.
! 	 */
! 	new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
  
! 	if (new_rel_pages >= old_rel_pages)
! 	{
! 		/* can't do anything after all */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Okay to truncate.
! 	 */
! 	RelationTruncate(onerel, new_rel_pages);
  
! 	/*
! 	 * We can release the exclusive lock as soon as we have truncated.	Other
! 	 * backends can't safely access the relation until they have processed the
! 	 * smgr invalidation that smgrtruncate sent out ... but that should happen
! 	 * as part of standard invalidation processing once they acquire lock on
! 	 * the relation.
! 	 */
! 	UnlockRelation(onerel, AccessExclusiveLock);
  
! 	/*
! 	 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 	 * also touching reltuples, since the tuple count wasn't changed by the
! 	 * truncation.
! 	 */
! 	vacrelstats->rel_pages = new_rel_pages;
! 	vacrelstats->pages_removed = old_rel_pages - new_rel_pages;
  
! 	ereport(elevel,
! 			(errmsg("\"%s\": truncated %u to %u pages",
! 					RelationGetRelationName(onerel),
! 					old_rel_pages, new_rel_pages),
! 			 errdetail("%s.",
! 					   pg_rusage_show(&ru0))));
  }
  
  /*
--- 1272,1411 ----
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
+ 	int			lock_retry;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * Calculate how frequently we check for a conflicting lock
! 	 * request and how hard we try to (re)acquire the exclusive
! 	 * lock for scanning the supposedly empty blocks at the end.
! 	 * With the default deadlock_timeout of 1000ms, we check every
! 	 * 100ms if we should release a lock we hold and we try for
! 	 * autovacuum_truncate_lock_try (2000ms default) in 50ms
! 	 * intervals to (re)acquire it.
  	 */
! 	autovacuum_truncate_lock_check = DeadlockTimeout / 10;
! 	if (autovacuum_truncate_lock_check < 10)
! 		autovacuum_truncate_lock_check = 10;
! 	autovacuum_truncate_lock_wait = DeadlockTimeout / 20;
! 	if (autovacuum_truncate_lock_wait < 10)
! 		autovacuum_truncate_lock_wait = 10;
! 	autovacuum_truncate_lock_retry = autovacuum_truncate_lock_try /
! 									 autovacuum_truncate_lock_wait;
  
  	/*
! 	 * Loop until no more truncating can be done.
  	 */
! 	do
  	{
  		/*
! 		 * We need full exclusive lock on the relation in order to do 
! 		 * truncation.
! 		 * If we can't get it, give up rather than waiting --- we don't want to
! 		 * block other backends, and we don't want to deadlock (which is quite
! 		 * possible considering we already hold a lower-grade lock).
  		 */
! 		vacrelstats->lock_waiter_detected = false;
! 		lock_retry = 0;
! 		while (true)
! 		{
! 			if (ConditionalLockRelation(onerel, AccessExclusiveLock))
! 				break;
  
! 			if (autovacuum_truncate_lock_retry == 0)
! 				return;
  
! 			/*
! 			 * Check for interrupts while trying to (re-)acquire
! 			 * the exclusive lock.
! 			 */
! 			CHECK_FOR_INTERRUPTS();
  
! 			if (++lock_retry > autovacuum_truncate_lock_retry)
! 			{
! 				/*
! 				 * We failed to establish the lock in the specified
! 				 * number of retries. This means we give up truncating.
! 				 * Suppress the ANALYZE step. Doing an ANALYZE at
! 				 * this point will reset the dead_tuple_count in the
! 				 * stats collector, so we will not get called by the
! 				 * autovacuum launcher again to do the truncate.
! 				 */
! 				vacrelstats->lock_waiter_detected = true;
! 				return;
! 			}
  
! 			if (autovacuum_truncate_lock_wait > 0)
! 				pg_usleep((long)autovacuum_truncate_lock_wait);
! 		}
  
! 		/*
! 		 * Now that we have exclusive lock, look to see if the rel has grown
! 		 * whilst we were vacuuming with non-exclusive lock.  If so, give up;
! 		 * the newly added pages presumably contain non-deletable tuples.
! 		 */
! 		new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 		if (new_rel_pages != old_rel_pages)
! 		{
! 			/*
! 			 * Note: we intentionally don't update vacrelstats->rel_pages 
! 			 * with the new rel size here.  If we did, it would amount to 
! 			 * assuming that the new pages are empty, which is unlikely.
! 			 * Leaving the numbers alone amounts to assuming that the new 
! 			 * pages have the same tuple density as existing ones, which 
! 			 * is less unlikely.
! 			 */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
  
! 		/*
! 		 * Scan backwards from the end to verify that the end pages actually
! 		 * contain no tuples.  This is *necessary*, not optional, because other
! 		 * backends could have added tuples to these pages whilst we were
! 		 * vacuuming.
! 		 */
! 		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
! 
! 		if (new_rel_pages >= old_rel_pages)
! 		{
! 			/* can't do anything after all */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
! 
! 		/*
! 		 * Okay to truncate.
! 		 */
! 		RelationTruncate(onerel, new_rel_pages);
! 
! 		/*
! 		 * We can release the exclusive lock as soon as we have truncated.
! 		 * Other backends can't safely access the relation until they have 
! 		 * processed the smgr invalidation that smgrtruncate sent out ... 
! 		 * but that should happen as part of standard invalidation 
! 		 * processing once they acquire lock on the relation.
! 		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 
! 		/*
! 		 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 		 * also touching reltuples, since the tuple count wasn't changed by the
! 		 * truncation.
! 		 */
! 		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
! 		vacrelstats->rel_pages = new_rel_pages;
! 
! 		ereport(elevel,
! 				(errmsg("\"%s\": truncated %u to %u pages",
! 						RelationGetRelationName(onerel),
! 						old_rel_pages, new_rel_pages),
! 				 errdetail("%s.",
! 						   pg_rusage_show(&ru0))));
! 		old_rel_pages = new_rel_pages;
! 	} while (new_rel_pages > vacrelstats->nonempty_pages && 
! 			vacrelstats->lock_waiter_detected);
  }
  
  /*
*************** static BlockNumber
*** 1340,1345 ****
--- 1417,1428 ----
  count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  {
  	BlockNumber blkno;
+ 	instr_time	starttime;
+ 	instr_time	currenttime;
+ 	instr_time	elapsed;
+ 
+ 	/* Initialize the starttime if we check for conflicting lock requests */
+ 	INSTR_TIME_SET_CURRENT(starttime);
  
  	/* Strange coding of loop control is needed because blkno is unsigned */
  	blkno = vacrelstats->rel_pages;
*************** count_nondeletable_pages(Relation onerel
*** 1352,1357 ****
--- 1435,1471 ----
  		bool		hastup;
  
  		/*
+ 		 * Check if another process requests a lock on our relation.
+ 		 * We are holding an AccessExclusiveLock here, so they will
+ 		 * be waiting. We only do this in autovacuum_truncate_lock_check
+ 		 * millisecond intervals, and we only check if that interval
+ 		 * has elapsed once every 32 blocks to keep the number of
+ 		 * system calls and actual shared lock table lookups to a
+ 		 * minimum.
+ 		 */
+ 		if ((blkno % 32) == 0)
+ 		{
+ 			INSTR_TIME_SET_CURRENT(currenttime);
+ 			elapsed = currenttime;
+ 			INSTR_TIME_SUBTRACT(elapsed, starttime);
+ 			if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000) 
+ 					>= autovacuum_truncate_lock_check)
+ 			{
+ 				if (LockHasWaitersRelation(onerel, AccessExclusiveLock))
+ 				{
+ 					ereport(elevel,
+ 							(errmsg("\"%s\": terminating truncate "
+ 									"due to conflicting lock request",
+ 									RelationGetRelationName(onerel))));
+ 
+ 					vacrelstats->lock_waiter_detected = true;
+ 					return blkno;
+ 				}
+ 				starttime = currenttime;
+ 			}
+ 		}
+ 
+ 		/*
  		 * We don't insert a vacuum delay point here, because we have an
  		 * exclusive lock on the table which we want to hold for as short a
  		 * time as possible.  We still need to check for interrupts however.
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6977bcf..547eab6 100644
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** int			autovacuum_freeze_max_age;
*** 118,123 ****
--- 118,124 ----
  
  int			autovacuum_vac_cost_delay;
  int			autovacuum_vac_cost_limit;
+ int			autovacuum_truncate_lock_try;
  
  int			Log_autovacuum_min_duration = -1;
  
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index a7786d0..e1fa74f 100644
*** a/src/backend/storage/lmgr/lmgr.c
--- b/src/backend/storage/lmgr/lmgr.c
*************** UnlockRelation(Relation relation, LOCKMO
*** 233,238 ****
--- 233,256 ----
  }
  
  /*
+  *		LockHasWaitersRelation
+  *
+  * This is a functiion to check if someone else is waiting on a
+  * lock, we are currently holding.
+  */
+ bool
+ LockHasWaitersRelation(Relation relation, LOCKMODE lockmode)
+ {
+ 	LOCKTAG		tag;
+ 
+ 	SET_LOCKTAG_RELATION(tag,
+ 						 relation->rd_lockInfo.lockRelId.dbId,
+ 						 relation->rd_lockInfo.lockRelId.relId);
+ 
+ 	return LockHasWaiters(&tag, lockmode, false);
+ }
+ 
+ /*
   *		LockRelationIdForSession
   *
   * This routine grabs a session-level lock on the target relation.	The
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 32cc229..b2a818c 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** ProcLockHashCode(const PROCLOCKTAG *proc
*** 539,544 ****
--- 539,636 ----
  	return lockhash;
  }
  
+ /*
+  * LockHasWaiters -- look up 'locktag' and check if releasing this
+  *		lock would wake up other processes waiting for it.
+  */
+ bool
+ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
+ {
+ 	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+ 	LockMethod	lockMethodTable;
+ 	LOCALLOCKTAG localtag;
+ 	LOCALLOCK  *locallock;
+ 	LOCK	   *lock;
+ 	PROCLOCK   *proclock;
+ 	LWLockId	partitionLock;
+ 	bool		hasWaiters = FALSE;
+ 
+ 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+ 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+ 	lockMethodTable = LockMethods[lockmethodid];
+ 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
+ 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
+ 
+ #ifdef LOCK_DEBUG
+ 	if (LOCK_DEBUG_ENABLED(locktag))
+ 		elog(LOG, "LockHasWaiters: lock [%u,%u] %s",
+ 			 locktag->locktag_field1, locktag->locktag_field2,
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ #endif
+ 
+ 	/*
+ 	 * Find the LOCALLOCK entry for this lock and lockmode
+ 	 */
+ 	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+ 	localtag.lock = *locktag;
+ 	localtag.mode = lockmode;
+ 
+ 	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+ 										  (void *) &localtag,
+ 										  HASH_FIND, NULL);
+ 
+ 	/*
+ 	 * let the caller print its own error message, too. Do not ereport(ERROR).
+ 	 */
+ 	if (!locallock || locallock->nLocks <= 0)
+ 	{
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		return FALSE;
+ 	}
+ 
+ 	/*
+ 	 * Check the shared lock table.
+ 	 */
+ 	partitionLock = LockHashPartitionLock(locallock->hashcode);
+ 
+ 	LWLockAcquire(partitionLock, LW_SHARED);
+ 
+ 	/*
+ 	 * We don't need to re-find the lock or proclock, since we kept their
+ 	 * addresses in the locallock table, and they couldn't have been removed
+ 	 * while we were holding a lock on them.
+ 	 */
+ 	lock = locallock->lock;
+ 	LOCK_PRINT("LockHasWaiters: found", lock, lockmode);
+ 	proclock = locallock->proclock;
+ 	PROCLOCK_PRINT("LockHasWaiters: found", proclock);
+ 
+ 	/*
+ 	 * Double-check that we are actually holding a lock of the type we want to
+ 	 * release.
+ 	 */
+ 	if (!(proclock->holdMask & LOCKBIT_ON(lockmode)))
+ 	{
+ 		PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
+ 		LWLockRelease(partitionLock);
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		RemoveLocalLock(locallock);
+ 		return FALSE;
+ 	}
+ 
+ 	/*
+ 	 * Do the checking.
+ 	 */
+ 	if ((lockMethodTable->conflictTab[lockmode] & lock->waitMask) != 0)
+ 		hasWaiters = TRUE;
+ 
+ 	LWLockRelease(partitionLock);
+ 
+ 	return hasWaiters;
+ }
+ 
  
  /*
   * LockAcquire -- Check for lock conflicts, sleep if conflict found,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 745e7be..8937ba5 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_int ConfigureNamesI
*** 1815,1820 ****
--- 1815,1831 ----
  	},
  
  	{
+ 		{"autovacuum_truncate_lock_try", PGC_SIGHUP, AUTOVACUUM,
+ 			gettext_noop("How long autovacuum tries to acquire an exclusive lock for truncate."),
+ 			NULL,
+ 			GUC_UNIT_MS
+ 		},
+ 		&autovacuum_truncate_lock_try,
+ 		2000, 0, 20000,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"max_files_per_process", PGC_POSTMASTER, RESOURCES_KERNEL,
  			gettext_noop("Sets the maximum number of simultaneously open files for each server process."),
  			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index eeb9b82..f14ea73 100644
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 471,476 ****
--- 471,478 ----
  #autovacuum_vacuum_cost_limit = -1	# default vacuum cost limit for
  					# autovacuum, -1 means use
  					# vacuum_cost_limit
+ #autovacuum_truncate_lock_try = 2000ms	# How long autovacuum tries to
+ 					# acquire an exclusive lock for truncate.
  
  
  #------------------------------------------------------------------------------
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index a851758..b17086c 100644
*** a/src/include/postmaster/autovacuum.h
--- b/src/include/postmaster/autovacuum.h
*************** extern double autovacuum_anl_scale;
*** 26,31 ****
--- 26,32 ----
  extern int	autovacuum_freeze_max_age;
  extern int	autovacuum_vac_cost_delay;
  extern int	autovacuum_vac_cost_limit;
+ extern int	autovacuum_truncate_lock_try;
  
  /* autovacuum launcher PID, only valid when worker is shutting down */
  extern int	AutovacuumLauncherPid;
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index de340c4..aa79eda 100644
*** a/src/include/storage/lmgr.h
--- b/src/include/storage/lmgr.h
*************** extern void UnlockRelationOid(Oid relid,
*** 31,36 ****
--- 31,37 ----
  extern void LockRelation(Relation relation, LOCKMODE lockmode);
  extern bool ConditionalLockRelation(Relation relation, LOCKMODE lockmode);
  extern void UnlockRelation(Relation relation, LOCKMODE lockmode);
+ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
  
  extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
  extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index d56f0fa..f0eca35 100644
*** a/src/include/storage/lock.h
--- b/src/include/storage/lock.h
*************** extern void LockReleaseAll(LOCKMETHODID 
*** 494,499 ****
--- 494,501 ----
  extern void LockReleaseSession(LOCKMETHODID lockmethodid);
  extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
  extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+ extern bool LockHasWaiters(const LOCKTAG *locktag,
+ 			LOCKMODE lockmode, bool sessionLock);
  extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
  				 LOCKMODE lockmode);
  extern void AtPrepare_Locks(void);
#27Kevin Grittner
kgrittn@mail.com
In reply to: Jan Wieck (#26)
Re: autovacuum truncate exclusive lock round two

Jan Wieck wrote:

Attached is a new patch that addresses most of the points raised
in discussion before.

1) Most of the configuration variables are derived from
deadlock_timeout now. The "check for conflicting lock request"
interval is deadlock_timeout/10, clamped to 10ms. The "try to
acquire exclusive lock" interval is deadlock_timeout/20, also
clamped to 10ms. The only GUC variable remaining is
autovacuum_truncate_lock_try=2000ms with a range from 0 (just try
once) to 20000ms.

If we're going to keep this GUC, we need docs for it.

I'd like to point out that this is a significant change in
functionality as without the config option for the check
interval, there is no longer any possibility to disable the call
to LockHasWaiters() and return to the original (deadlock code
kills autovacuum) behavior.

Arguably we could simplify the deadlock resolution code a little,
but it seems like it is probably safer to leave it as a failsafe,
at least for now.

I did not touch the part about suppressing the stats and the
ANALYZE step of "auto vacuum+analyze". The situation is no
different today. When the deadlock code kills autovacuum, stats
aren't updated either. And this patch is meant to cause
autovacuum to finish the truncate in a few minutes or hours, so
that the situation fixes itself.

Unless someone want to argue for more aggressive updating of the
stats, I guess I'll yield to that argument.

Besides the documentation, the only thing I could spot on this
go-around was that this comment probably no longer has a reason for
being since this is no longer conditional and what it's doing is
obvious from the code:

   /* Initialize the starttime if we check for conflicting lock requests */

With docs and dropping the obsolete comment, I think this will be
ready.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#27)
Re: autovacuum truncate exclusive lock round two

On 12/3/2012 5:42 PM, Kevin Grittner wrote:

Jan Wieck wrote:

Attached is a new patch that addresses most of the points raised
in discussion before.

1) Most of the configuration variables are derived from
deadlock_timeout now. The "check for conflicting lock request"
interval is deadlock_timeout/10, clamped to 10ms. The "try to
acquire exclusive lock" interval is deadlock_timeout/20, also
clamped to 10ms. The only GUC variable remaining is
autovacuum_truncate_lock_try=2000ms with a range from 0 (just try
once) to 20000ms.

If we're going to keep this GUC, we need docs for it.

Certainly. But since we're still debating which and how many GUC
variables we want, I don't think doc-time has come yet.

I'd like to point out that this is a significant change in
functionality as without the config option for the check
interval, there is no longer any possibility to disable the call
to LockHasWaiters() and return to the original (deadlock code
kills autovacuum) behavior.

Arguably we could simplify the deadlock resolution code a little,
but it seems like it is probably safer to leave it as a failsafe,
at least for now.

Thinking about it, I'm not really happy with removing the
autovacuum_truncate_lock_check GUC at all.

Fact is that the deadlock detection code and the configuration parameter
for it should IMHO have nothing to do with all this in the first place.
A properly implemented application does not deadlock. Someone running
such a properly implemented application should be able to safely set
deadlock_timeout to minutes without the slightest ill side effect, but
with the benefit that the deadlock detection code itself does not add to
the lock contention. The only reason one cannot do so today is because
autovacuum's truncate phase could then freeze the application with an
exclusive lock for that long.

I believe the check interval needs to be decoupled from the
deadlock_timeout again. This will leave us with 2 GUCs at least.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Kevin Grittner
kgrittn@mail.com
In reply to: Jan Wieck (#28)
Re: autovacuum truncate exclusive lock round two

Jan Wieck wrote:

Thinking about it, I'm not really happy with removing the
autovacuum_truncate_lock_check GUC at all.

Fact is that the deadlock detection code and the configuration
parameter for it should IMHO have nothing to do with all this in
the first place. A properly implemented application does not
deadlock.

I don't agree. I believe that in some cases it is possible and
practicable to set access rules which would prevent deadlocks in
application access to a database. In other cases the convolutions
required in the code, the effort in educating dozens or hundreds of
programmers maintaining the code (and keeping the training current
during staff turnover), and the staff time required for compliance
far outweigh the benefit of an occasional transaction retry.
However, it is enough for your argument that there are cases where
it can be done.

Someone running such a properly implemented application should be
able to safely set deadlock_timeout to minutes without the
slightest ill side effect, but with the benefit that the deadlock
detection code itself does not add to the lock contention. The
only reason one cannot do so today is because autovacuum's
truncate phase could then freeze the application with an
exclusive lock for that long.

I believe the check interval needs to be decoupled from the
deadlock_timeout again.

OK

This will leave us with 2 GUCs at least.

Hmm. What problems do you see with hard-coding reasonable values?
Adding two or three GUC settings for a patch with so little
user-visible impact seems weird. And it seems to me (and also
seemed to Robert) as though the specific values of the other two
settings really aren't that critical as long as they are anywhere
within a reasonable range. Configuring PostgreSQL can be
intimidating enough without adding knobs that really don't do
anything useful. Can you show a case where special values would be
helpful?

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#29)
Re: autovacuum truncate exclusive lock round two

On 12/4/2012 8:06 AM, Kevin Grittner wrote:

Jan Wieck wrote:

I believe the check interval needs to be decoupled from the
deadlock_timeout again.

OK

This will leave us with 2 GUCs at least.

Hmm. What problems do you see with hard-coding reasonable values?

The question is what is reasonable?

Lets talk about the time to (re)acquire the lock first. In the cases
where truncating a table can hurt we are dealing with many gigabytes.
The original vacuumlazy scan of them can take hours if not days. During
that scan the vacuum worker has probably spent many hours napping in the
vacuum delay points. For me 50ms interval for 5 seconds would be
reasonable for (re)acquiring that lock.

The reasoning behind it being that we need some sort of retry mechanism
because if autovacuum just gave up the exclusive lock because someone
needed access, it is more or less guaranteed that the immediate attempt
to reacquire it will fail until that waiter has committed. But if it
can't get a lock after 5 seconds, the system seems busy enough so that
autovacuum should come back much later, when the launcher kicks it off
again.

I don't care much about occupying that autovacuum worker for a few
seconds. It just spent hours vacuuming that very table. How much harm
will a couple more seconds do?

The check interval for the LockHasWaiters() call however depends very
much on the response time constraints of the application. A 200ms
interval for example would cause the truncate phase to hold onto the
exclusive lock for 200ms at least. That means that a steady stream of
short running transactions would see a 100ms "blocking" on average,
200ms max. For many applications that is probably OK. If your response
time constraint is <=50ms on 98% of transactions, you might want to have
that knob though.

I admit I really have no idea what the most reasonable default for that
value would be. Something between 50ms and deadlock_timeout/2 I guess.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Kevin Grittner
kgrittn@mail.com
In reply to: Jan Wieck (#30)
Re: autovacuum truncate exclusive lock round two

Jan Wieck wrote:

[arguments for GUCs]

This is getting confusing. I thought I had already conceded the
case for autovacuum_truncate_lock_try, and you appeared to spend
most of your post arguing for it anyway. I think. It's a little
hard to tell. Perhaps the best thing is to present the issue to the
list and solicit more opinions on what to do. Please correct me if
I mis-state any of this.

The primary problem this patch is solving is that in some
workloads, autovacuum will repeatedly try to truncate the unused
pages at the end of a table, but will continually get canceled
after burning resources because another process wants to acquire a
lock on the table which conflicts with the one held by autovacuum.
This is handled by the deadlock checker, so another process must
block for the deadlock_timeout interval each time. All work done by
the truncate phase of autovacuum is lost on each interrupted
attempt. Statistical information is not updated, so another attempt
will trigger the next time autovacuum looks at whether to vacuum
the table.

It's obvious that this pattern not only fails to release
potentially large amounts of unused space back to the OS, but the
headbanging can continue to consume significant resources and for
an extended period, and the repeated blocking for deadlock_timeout
could cause latency problems.

The patch has the truncate work, which requires
AccessExclusiveLock, check at intervals for whether another process
is waiting on its lock. That interval is one of the timings we need
to determine, and for which a GUC was initially proposed. I think
that the check should be fast enough that doing it once every 20ms
as a hard-coded interval would be good enough. When it sees this
situation, it truncates the file for as far as it has managed to
get, releases its lock on the table, sleeps for an interval, and
then checks to see if the lock has become available again.

How long it should sleep between tries to reacquire the lock is
another possible GUC. Again, I'm inclined to think that this could
be hard-coded. Since autovacuum was knocked off-task after doing
some significant work, I'm inclined to make this interval a little
bigger, but I don't think it matters a whole lot. Anything between
20ms and 100ms seens sane. Maybe 50ms?

At any point that it is unable to acquire the lock, there is a
check for how long this autovacuum task has been starved for the
lock. Initially I was arguing for twice the deadlock_timeout on the
basis that this would probably be short enough not to leave the
autovacuum worker sidelined for too long, but long enough for the
attempt to get past a single deadlock between two other processes.
This is the setting Jan is least willing to concede.

If the autovacuum worker does abandon the attempt, it will keep
retrying, since we go out of our way to prevent the autovacuum
process from updating the statistics based on the "incomplete"
processing. This last interval is not how long it will attempt to
truncate, but how long it will keep one autovacuum worker making
unsuccessful attempts to acquire the lock before it is put to other
uses. Workers will keep coming back to this table until the
truncate phase is completed, just as it does without the patch; the
difference being that anytime it gets the lock, even briefly, it is
able to persist some progress.

So the question on the table is which of these three intervals
should be GUCs, and what values to use if they aren't.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#31)
Re: autovacuum truncate exclusive lock round two

On 12/4/2012 1:51 PM, Kevin Grittner wrote:

Jan Wieck wrote:

[arguments for GUCs]

This is getting confusing. I thought I had already conceded the
case for autovacuum_truncate_lock_try, and you appeared to spend
most of your post arguing for it anyway. I think. It's a little
hard to tell. Perhaps the best thing is to present the issue to the
list and solicit more opinions on what to do. Please correct me if
I mis-state any of this.

The primary problem this patch is solving is that in some
workloads, autovacuum will repeatedly try to truncate the unused
pages at the end of a table, but will continually get canceled
after burning resources because another process wants to acquire a
lock on the table which conflicts with the one held by autovacuum.
This is handled by the deadlock checker, so another process must
block for the deadlock_timeout interval each time. All work done by
the truncate phase of autovacuum is lost on each interrupted
attempt. Statistical information is not updated, so another attempt
will trigger the next time autovacuum looks at whether to vacuum
the table.

It's obvious that this pattern not only fails to release
potentially large amounts of unused space back to the OS, but the
headbanging can continue to consume significant resources and for
an extended period, and the repeated blocking for deadlock_timeout
could cause latency problems.

The patch has the truncate work, which requires
AccessExclusiveLock, check at intervals for whether another process
is waiting on its lock. That interval is one of the timings we need
to determine, and for which a GUC was initially proposed. I think
that the check should be fast enough that doing it once every 20ms
as a hard-coded interval would be good enough. When it sees this
situation, it truncates the file for as far as it has managed to
get, releases its lock on the table, sleeps for an interval, and
then checks to see if the lock has become available again.

How long it should sleep between tries to reacquire the lock is
another possible GUC. Again, I'm inclined to think that this could
be hard-coded. Since autovacuum was knocked off-task after doing
some significant work, I'm inclined to make this interval a little
bigger, but I don't think it matters a whole lot. Anything between
20ms and 100ms seens sane. Maybe 50ms?

At any point that it is unable to acquire the lock, there is a
check for how long this autovacuum task has been starved for the
lock. Initially I was arguing for twice the deadlock_timeout on the
basis that this would probably be short enough not to leave the
autovacuum worker sidelined for too long, but long enough for the
attempt to get past a single deadlock between two other processes.
This is the setting Jan is least willing to concede.

If the autovacuum worker does abandon the attempt, it will keep
retrying, since we go out of our way to prevent the autovacuum
process from updating the statistics based on the "incomplete"
processing. This last interval is not how long it will attempt to
truncate, but how long it will keep one autovacuum worker making
unsuccessful attempts to acquire the lock before it is put to other
uses. Workers will keep coming back to this table until the
truncate phase is completed, just as it does without the patch; the
difference being that anytime it gets the lock, even briefly, it is
able to persist some progress.

That is all correct.

So the question on the table is which of these three intervals
should be GUCs, and what values to use if they aren't.

I could live with all the above defaults, but would like to see more
comments on them.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Robert Haas
robertmhaas@gmail.com
In reply to: Jan Wieck (#32)
Re: autovacuum truncate exclusive lock round two

On Tue, Dec 4, 2012 at 2:05 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

So the question on the table is which of these three intervals
should be GUCs, and what values to use if they aren't.

I could live with all the above defaults, but would like to see more
comments on them.

I largely agree with what's already been said. The only interval that
seems to me to maybe need its own knob is the total time after which
the autovacuum worker will give up. If we set it to 2 *
deadlock_timeout, some people might find that a reason to raise
deadlock_timeout. Since people *already* raise deadlock_timeout to
obscenely high values (a minute? an hour???) and then complain that
things blow up in their face, I think there's a decent argument to be
made that piggybacking anything else on that setting is unwise.

Against that, FWICT, this problem only affects a small number of
users: Jan is the only person I can ever remember reporting this
issue. I'm not dumb enough to think he's the only person who it
affects; but my current belief is that it's not an enormously common
problem. So the main argument I can see against adding a GUC is that
the problem is too marginal to justify a setting of its own. What I
really see as the key issue is: suppose we hardcode this to say 2
seconds. Is that going to fix the problem effectively for 99% of the
people who have this problem, or for 25% of the people who have this
problem? In the former case, we probably don't need a GUC; in the
latter case, we probably do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Kevin Grittner
kgrittn@mail.com
In reply to: Robert Haas (#33)
Re: autovacuum truncate exclusive lock round two

Robert Haas wrote:

Since people *already* raise deadlock_timeout to obscenely high
values (a minute? an hour???) and then complain that things blow
up in their face, I think there's a decent argument to be made
that piggybacking anything else on that setting is unwise.

If people are really doing that, then I tend to agree. I wasn't
aware of that practice.

Against that, FWICT, this problem only affects a small number of
users: Jan is the only person I can ever remember reporting this
issue. I'm not dumb enough to think he's the only person who it
affects; but my current belief is that it's not an enormously
common problem. So the main argument I can see against adding a
GUC is that the problem is too marginal to justify a setting of
its own. What I really see as the key issue is: suppose we
hardcode this to say 2 seconds. Is that going to fix the problem
effectively for 99% of the people who have this problem, or for
25% of the people who have this problem? In the former case, we
probably don't need a GUC; in the latter case, we probably do.

Given the fact that autovacuum will keep throwing workers at it to
essentially loop indefinitely at an outer level, I don't think the
exact setting of this interval is all that critical either. My gut
feel is that anything in the 2 second to 5 second range would be
sane, so I won't argue over any explicit setting within that range.
Below that, I think the overhead of autovacuum coming back to the
table repeatedly would probably start to get too high; below that
we could be causing some small, heavily-updated table to be
neglected by autovacuum -- especially if you get multiple
autovacuum workers tied up in this delay on different tables at the
same time.

Regarding how many people are affected, I have seen several reports
of situations where users claim massive impact on performance when
autovacuum kicks in. The reports have not included enough detail to
quantify the impact or in most cases to establish a cause, but this
seems like it could have a noticable impact, especially if the
deadlock timeout was set to more than a second.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#34)
Re: autovacuum truncate exclusive lock round two

On Wed, Dec 5, 2012 at 11:24 AM, Kevin Grittner <kgrittn@mail.com> wrote:

Robert Haas wrote:

Since people *already* raise deadlock_timeout to obscenely high
values (a minute? an hour???) and then complain that things blow
up in their face, I think there's a decent argument to be made
that piggybacking anything else on that setting is unwise.

If people are really doing that, then I tend to agree. I wasn't
aware of that practice.

It's probably not quite common enough to be called a "practice", but I
have encountered it a number of times in support situations. Alas, I
no longer remember the details of exactly what misery it caused, but I
do remember it wasn't good. :-)

Against that, FWICT, this problem only affects a small number of
users: Jan is the only person I can ever remember reporting this
issue. I'm not dumb enough to think he's the only person who it
affects; but my current belief is that it's not an enormously
common problem. So the main argument I can see against adding a
GUC is that the problem is too marginal to justify a setting of
its own. What I really see as the key issue is: suppose we
hardcode this to say 2 seconds. Is that going to fix the problem
effectively for 99% of the people who have this problem, or for
25% of the people who have this problem? In the former case, we
probably don't need a GUC; in the latter case, we probably do.

Given the fact that autovacuum will keep throwing workers at it to
essentially loop indefinitely at an outer level, I don't think the
exact setting of this interval is all that critical either. My gut
feel is that anything in the 2 second to 5 second range would be
sane, so I won't argue over any explicit setting within that range.
Below that, I think the overhead of autovacuum coming back to the
table repeatedly would probably start to get too high; below that
we could be causing some small, heavily-updated table to be
neglected by autovacuum -- especially if you get multiple
autovacuum workers tied up in this delay on different tables at the
same time.

I think that part of what's tricky here is that the dynamics of this
problem depend heavily on table size. I handled one support case
where lowering autovacuum_naptime to 15s was an indispenable part of
the solution, so in that case having an autovacuum worker retry for
more than a few seconds sounds kind of insane. OTOH, that case
involved a small, rapidly changing table. If you've got an enormous
table where vacuum takes an hour to chug through all of it, abandoning
the effort to truncate the table after a handful of seconds might
sound equally insane.

Many it'd be sensible to relate the retry time to the time spend
vacuuming the table. Say, if the amount of time spent retrying
exceeds 10% of the time spend vacuuming the table, with a minimum of
1s and a maximum of 1min, give up. That way, big tables will get a
little more leeway than small tables, which is probably appropriate.

Regarding how many people are affected, I have seen several reports
of situations where users claim massive impact on performance when
autovacuum kicks in. The reports have not included enough detail to
quantify the impact or in most cases to establish a cause, but this
seems like it could have a noticable impact, especially if the
deadlock timeout was set to more than a second.

Yeah, I agree this could be a cause of those types of reports, but I
don't have any concrete evidence that any of the cases I've worked
were actually due to this specific issue. The most recent case of
this type I worked on was due to I/O saturation - which, since it
happened to be EC2, really meant network saturation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Jan Wieck
JanWieck@Yahoo.com
In reply to: Robert Haas (#35)
Re: autovacuum truncate exclusive lock round two

On 12/5/2012 2:00 PM, Robert Haas wrote:

Many it'd be sensible to relate the retry time to the time spend
vacuuming the table. Say, if the amount of time spent retrying
exceeds 10% of the time spend vacuuming the table, with a minimum of
1s and a maximum of 1min, give up. That way, big tables will get a
little more leeway than small tables, which is probably appropriate.

That sort of "dynamic" approach would indeed be interesting. But I fear
that it is going to be complex at best. The amount of time spent in
scanning heavily depends on the visibility map. The initial vacuum scan
of a table can take hours or more, but it does update the visibility map
even if the vacuum itself is aborted later. The next vacuum may scan
that table in almost no time at all, because it skips all blocks that
are marked "all visible".

So the total time the "scan" takes is no yardstick I'd use.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Robert Haas
robertmhaas@gmail.com
In reply to: Jan Wieck (#36)
Re: autovacuum truncate exclusive lock round two

On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

On 12/5/2012 2:00 PM, Robert Haas wrote:

Many it'd be sensible to relate the retry time to the time spend
vacuuming the table. Say, if the amount of time spent retrying
exceeds 10% of the time spend vacuuming the table, with a minimum of
1s and a maximum of 1min, give up. That way, big tables will get a
little more leeway than small tables, which is probably appropriate.

That sort of "dynamic" approach would indeed be interesting. But I fear that
it is going to be complex at best. The amount of time spent in scanning
heavily depends on the visibility map. The initial vacuum scan of a table
can take hours or more, but it does update the visibility map even if the
vacuum itself is aborted later. The next vacuum may scan that table in
almost no time at all, because it skips all blocks that are marked "all
visible".

Well, if that's true, then there's little reason to worry about giving
up quickly, because the next autovacuum a minute later won't consume
many resources.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#36)
Re: autovacuum truncate exclusive lock round two

Kevin and Robert are well aware of most of the below. I just want to put
this out here so other people, who haven't followed the discussion too
closely, may chime in.

Some details on the problem:

First of all, there is a minimum number of 1000 pages that the vacuum
scan must detect as possibly being all empty at the end of a relation.
Without at least 8MB of possible free space at the end, the code never
calls lazy_truncate_heap(). This means we don't have to worry about tiny
relations at all. Any relation that stays under 8MB turnover between
autovacuum VACUUM runs can never get into this ever.

Relations that have higher turnover than that, but at random places or
with a high percentage of rather static rows, don't fall into the
problem category either. They may never accumulate that much "contiguous
free space at the end". The turnover will be reusing free space all over
the place. So again, lazy_truncate_heap() won't be called ever.

Relations that eventually build up more than 8MB of free space at the
end aren't automatically a problem. The autovacuum VACUUM scan just
scanned those pages at the end, which means that the safety scan for
truncate, done under exclusive lock, is checking exactly those pages at
the end and most likely they are still in memory. The truncate safety
scan will be fast due to a 99+% buffer cache hit rate.

The only actual problem case (I have found so far) are rolling window
tables of significant size, that can bloat multiple times their normal
size every now and then. This is indeed a rare corner case and I have no
idea how many users may (unknowingly) be suffering from it.

This rare corner case triggers lazy_truncate_heap() with a significant
amount of free space to truncate. The table bloats, then all the bloat
is deleted and the periodic 100% turnover will guarantee that all "live"
tuples will shortly after circulate in lower block numbers again, with
gigabytes of empty space at the end.

This by itself isn't a problem still. The existing code may do the job
just fine "unless" there is "frequent" access to that very table. Only
at this special combination of circumstances we actually have a problem.

Only now, with a significant amount of free space at the end and
frequent access to the table, the truncate safety scan takes long enough
and has to actually read pages from disk to interfere with client
transactions.

At this point, the truncate safety scan may have to be interrupted to
let the frequent other traffic go through. This is what we accomplish
with the autovacuum_truncate_lock_check interval, where we voluntarily
release the lock whenever someone else needs it. I agree with Kevin that
a 20ms check interval is reasonable because the code to check this is
even less expensive than releasing the exclusive lock we're holding.

At the same time, completely giving up and relying on the autovacuum
launcher to restart another worker isn't as free as it looks like
either. The next autovacuum worker will have to do the VACUUM scan
first, before getting to the truncate phase. We cannot just skip blindly
to the truncate code. With repeated abortion of the truncate, the table
would deteriorate and accumulate dead tuples again. The removal of dead
tuples and their index tuples has priority.

As said earlier in the discussion, the VACUUM scan will skip pages, that
are marked as completely visible. So the scan won't physically read the
majority of the empty pages at the end of the table over and over. But
it will at least scan all pages, that had been modified since the last
VACUUM run.

To me this means that we want to be more generous to the truncate code
about acquiring the exclusive lock. In my tests, I've seen that a
rolling window table with a "live" set of just 10 MB or so, but empty
space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that
work away because we can't acquire the exclusive lock withing 2 seconds
is a waste of effort.

Something in between 2-60 seconds sounds more reasonable to me.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Jan Wieck
JanWieck@Yahoo.com
In reply to: Robert Haas (#37)
1 attachment(s)
Re: autovacuum truncate exclusive lock round two

On 12/6/2012 12:45 PM, Robert Haas wrote:

On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

That sort of "dynamic" approach would indeed be interesting. But I fear that
it is going to be complex at best. The amount of time spent in scanning
heavily depends on the visibility map. The initial vacuum scan of a table
can take hours or more, but it does update the visibility map even if the
vacuum itself is aborted later. The next vacuum may scan that table in
almost no time at all, because it skips all blocks that are marked "all
visible".

Well, if that's true, then there's little reason to worry about giving
up quickly, because the next autovacuum a minute later won't consume
many resources.

"Almost no time" is of course "relative" to what an actual scan and dead
tuple removal cost. Looking at a table with 3 GB of dead tuples at the
end, the initial vacuum scan takes hours. When vacuum comes back to this
table, cleaning up a couple megabytes of newly deceased tuples and then
skipping over the all visible pages may take a minute.

Based on the discussion and what I feel is a consensus I have created an
updated patch that has no GUC at all. The hard coded parameters in
include/postmaster/autovacuum.h are

AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */

I gave that the worst workload I can think of. A pgbench (style)
application that throws about 10 transactions per second at it, so that
there is constantly the need to give up the lock due to conflicting lock
requests and then reacquiring it again. A "cleanup" process is
periodically moving old tuples from the history table to an archive
table, making history a rolling window table. And a third job that 2-3
times per minute produces a 10 second lasting transaction, forcing
autovacuum to give up on the lock reacquisition.

Even with that workload autovacuum slow but steady is chopping away at
the table.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachments:

autovacuum-truncate-lock-3.difftext/x-patch; name=autovacuum-truncate-lock-3.diffDownload
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c9253a9..fd3e91e 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 52,62 ****
--- 52,64 ----
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
  #include "storage/lmgr.h"
+ #include "storage/proc.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
  #include "utils/pg_rusage.h"
  #include "utils/timestamp.h"
  #include "utils/tqual.h"
+ #include "portability/instr_time.h"
  
  
  /*
*************** typedef struct LVRelStats
*** 103,108 ****
--- 105,111 ----
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 ****
--- 196,203 ----
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
  	vacrelstats->old_rel_tuples = onerel->rd_rel->reltuples;
  	vacrelstats->num_index_scans = 0;
+ 	vacrelstats->pages_removed = 0;
+ 	vacrelstats->lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 ****
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
--- 264,281 ----
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt 
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats->lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
+ 	else
+ 		vacstmt->options &= ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
*************** lazy_truncate_heap(Relation onerel, LVRe
*** 1255,1334 ****
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
! 	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
! 
! 	/*
! 	 * Now that we have exclusive lock, look to see if the rel has grown
! 	 * whilst we were vacuuming with non-exclusive lock.  If so, give up; the
! 	 * newly added pages presumably contain non-deletable tuples.
  	 */
! 	new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 	if (new_rel_pages != old_rel_pages)
  	{
  		/*
! 		 * Note: we intentionally don't update vacrelstats->rel_pages with the
! 		 * new rel size here.  If we did, it would amount to assuming that the
! 		 * new pages are empty, which is unlikely.	Leaving the numbers alone
! 		 * amounts to assuming that the new pages have the same tuple density
! 		 * as existing ones, which is less unlikely.
  		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Scan backwards from the end to verify that the end pages actually
! 	 * contain no tuples.  This is *necessary*, not optional, because other
! 	 * backends could have added tuples to these pages whilst we were
! 	 * vacuuming.
! 	 */
! 	new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
  
! 	if (new_rel_pages >= old_rel_pages)
! 	{
! 		/* can't do anything after all */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Okay to truncate.
! 	 */
! 	RelationTruncate(onerel, new_rel_pages);
  
! 	/*
! 	 * We can release the exclusive lock as soon as we have truncated.	Other
! 	 * backends can't safely access the relation until they have processed the
! 	 * smgr invalidation that smgrtruncate sent out ... but that should happen
! 	 * as part of standard invalidation processing once they acquire lock on
! 	 * the relation.
! 	 */
! 	UnlockRelation(onerel, AccessExclusiveLock);
  
! 	/*
! 	 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 	 * also touching reltuples, since the tuple count wasn't changed by the
! 	 * truncation.
! 	 */
! 	vacrelstats->rel_pages = new_rel_pages;
! 	vacrelstats->pages_removed = old_rel_pages - new_rel_pages;
  
! 	ereport(elevel,
! 			(errmsg("\"%s\": truncated %u to %u pages",
! 					RelationGetRelationName(onerel),
! 					old_rel_pages, new_rel_pages),
! 			 errdetail("%s.",
! 					   pg_rusage_show(&ru0))));
  }
  
  /*
--- 1268,1393 ----
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
+ 	int			lock_retry;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * Loop until no more truncating can be done.
  	 */
! 	do
  	{
  		/*
! 		 * We need full exclusive lock on the relation in order to do 
! 		 * truncation.
! 		 * If we can't get it, give up rather than waiting --- we don't want to
! 		 * block other backends, and we don't want to deadlock (which is quite
! 		 * possible considering we already hold a lower-grade lock).
  		 */
! 		vacrelstats->lock_waiter_detected = false;
! 		lock_retry = 0;
! 		while (true)
! 		{
! 			if (ConditionalLockRelation(onerel, AccessExclusiveLock))
! 				break;
  
! 			/*
! 			 * Check for interrupts while trying to (re-)acquire
! 			 * the exclusive lock.
! 			 */
! 			CHECK_FOR_INTERRUPTS();
  
! 			if (++lock_retry > (AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT /
! 								AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL))
! 			{
! 				/*
! 				 * We failed to establish the lock in the specified
! 				 * number of retries. This means we give up truncating.
! 				 * Suppress the ANALYZE step. Doing an ANALYZE at
! 				 * this point will reset the dead_tuple_count in the
! 				 * stats collector, so we will not get called by the
! 				 * autovacuum launcher again to do the truncate.
! 				 */
! 				vacrelstats->lock_waiter_detected = true;
! 				ereport(LOG,
! 						(errmsg("automatic vacuum of table \"%s.%s.%s\": "
! 								"cannot (re)acquire exclusive "
! 								"lock for truncate scan",
! 								get_database_name(MyDatabaseId),
! 								get_namespace_name(RelationGetNamespace(onerel)),
! 								RelationGetRelationName(onerel))));
! 				return;
! 			}
  
! 			pg_usleep(AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL);
! 		}
  
! 		/*
! 		 * Now that we have exclusive lock, look to see if the rel has grown
! 		 * whilst we were vacuuming with non-exclusive lock.  If so, give up;
! 		 * the newly added pages presumably contain non-deletable tuples.
! 		 */
! 		new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 		if (new_rel_pages != old_rel_pages)
! 		{
! 			/*
! 			 * Note: we intentionally don't update vacrelstats->rel_pages 
! 			 * with the new rel size here.  If we did, it would amount to 
! 			 * assuming that the new pages are empty, which is unlikely.
! 			 * Leaving the numbers alone amounts to assuming that the new 
! 			 * pages have the same tuple density as existing ones, which 
! 			 * is less unlikely.
! 			 */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
  
! 		/*
! 		 * Scan backwards from the end to verify that the end pages actually
! 		 * contain no tuples.  This is *necessary*, not optional, because other
! 		 * backends could have added tuples to these pages whilst we were
! 		 * vacuuming.
! 		 */
! 		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
  
! 		if (new_rel_pages >= old_rel_pages)
! 		{
! 			/* can't do anything after all */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
! 
! 		/*
! 		 * Okay to truncate.
! 		 */
! 		RelationTruncate(onerel, new_rel_pages);
! 
! 		/*
! 		 * We can release the exclusive lock as soon as we have truncated.
! 		 * Other backends can't safely access the relation until they have 
! 		 * processed the smgr invalidation that smgrtruncate sent out ... 
! 		 * but that should happen as part of standard invalidation 
! 		 * processing once they acquire lock on the relation.
! 		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 
! 		/*
! 		 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 		 * also touching reltuples, since the tuple count wasn't changed by the
! 		 * truncation.
! 		 */
! 		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
! 		vacrelstats->rel_pages = new_rel_pages;
! 
! 		ereport(elevel,
! 				(errmsg("\"%s\": truncated %u to %u pages",
! 						RelationGetRelationName(onerel),
! 						old_rel_pages, new_rel_pages),
! 				 errdetail("%s.",
! 						   pg_rusage_show(&ru0))));
! 		old_rel_pages = new_rel_pages;
! 	} while (new_rel_pages > vacrelstats->nonempty_pages && 
! 			vacrelstats->lock_waiter_detected);
  }
  
  /*
*************** static BlockNumber
*** 1340,1345 ****
--- 1399,1410 ----
  count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  {
  	BlockNumber blkno;
+ 	instr_time	starttime;
+ 	instr_time	currenttime;
+ 	instr_time	elapsed;
+ 
+ 	/* Initialize the starttime if we check for conflicting lock requests */
+ 	INSTR_TIME_SET_CURRENT(starttime);
  
  	/* Strange coding of loop control is needed because blkno is unsigned */
  	blkno = vacrelstats->rel_pages;
*************** count_nondeletable_pages(Relation onerel
*** 1352,1357 ****
--- 1417,1453 ----
  		bool		hastup;
  
  		/*
+ 		 * Check if another process requests a lock on our relation.
+ 		 * We are holding an AccessExclusiveLock here, so they will
+ 		 * be waiting. We only do this in autovacuum_truncate_lock_check
+ 		 * millisecond intervals, and we only check if that interval
+ 		 * has elapsed once every 32 blocks to keep the number of
+ 		 * system calls and actual shared lock table lookups to a
+ 		 * minimum.
+ 		 */
+ 		if ((blkno % 32) == 0)
+ 		{
+ 			INSTR_TIME_SET_CURRENT(currenttime);
+ 			elapsed = currenttime;
+ 			INSTR_TIME_SUBTRACT(elapsed, starttime);
+ 			if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000) 
+ 					>= AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ 			{
+ 				if (LockHasWaitersRelation(onerel, AccessExclusiveLock))
+ 				{
+ 					ereport(elevel,
+ 							(errmsg("\"%s\": suspending truncate "
+ 									"due to conflicting lock request",
+ 									RelationGetRelationName(onerel))));
+ 
+ 					vacrelstats->lock_waiter_detected = true;
+ 					return blkno;
+ 				}
+ 				starttime = currenttime;
+ 			}
+ 		}
+ 
+ 		/*
  		 * We don't insert a vacuum delay point here, because we have an
  		 * exclusive lock on the table which we want to hold for as short a
  		 * time as possible.  We still need to check for interrupts however.
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index a7786d0..e1fa74f 100644
*** a/src/backend/storage/lmgr/lmgr.c
--- b/src/backend/storage/lmgr/lmgr.c
*************** UnlockRelation(Relation relation, LOCKMO
*** 233,238 ****
--- 233,256 ----
  }
  
  /*
+  *		LockHasWaitersRelation
+  *
+  * This is a functiion to check if someone else is waiting on a
+  * lock, we are currently holding.
+  */
+ bool
+ LockHasWaitersRelation(Relation relation, LOCKMODE lockmode)
+ {
+ 	LOCKTAG		tag;
+ 
+ 	SET_LOCKTAG_RELATION(tag,
+ 						 relation->rd_lockInfo.lockRelId.dbId,
+ 						 relation->rd_lockInfo.lockRelId.relId);
+ 
+ 	return LockHasWaiters(&tag, lockmode, false);
+ }
+ 
+ /*
   *		LockRelationIdForSession
   *
   * This routine grabs a session-level lock on the target relation.	The
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0183443..ec4da20 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** ProcLockHashCode(const PROCLOCKTAG *proc
*** 538,543 ****
--- 538,635 ----
  	return lockhash;
  }
  
+ /*
+  * LockHasWaiters -- look up 'locktag' and check if releasing this
+  *		lock would wake up other processes waiting for it.
+  */
+ bool
+ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
+ {
+ 	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+ 	LockMethod	lockMethodTable;
+ 	LOCALLOCKTAG localtag;
+ 	LOCALLOCK  *locallock;
+ 	LOCK	   *lock;
+ 	PROCLOCK   *proclock;
+ 	LWLockId	partitionLock;
+ 	bool		hasWaiters = false;
+ 
+ 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+ 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+ 	lockMethodTable = LockMethods[lockmethodid];
+ 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
+ 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
+ 
+ #ifdef LOCK_DEBUG
+ 	if (LOCK_DEBUG_ENABLED(locktag))
+ 		elog(LOG, "LockHasWaiters: lock [%u,%u] %s",
+ 			 locktag->locktag_field1, locktag->locktag_field2,
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ #endif
+ 
+ 	/*
+ 	 * Find the LOCALLOCK entry for this lock and lockmode
+ 	 */
+ 	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+ 	localtag.lock = *locktag;
+ 	localtag.mode = lockmode;
+ 
+ 	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+ 										  (void *) &localtag,
+ 										  HASH_FIND, NULL);
+ 
+ 	/*
+ 	 * let the caller print its own error message, too. Do not ereport(ERROR).
+ 	 */
+ 	if (!locallock || locallock->nLocks <= 0)
+ 	{
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		return false;
+ 	}
+ 
+ 	/*
+ 	 * Check the shared lock table.
+ 	 */
+ 	partitionLock = LockHashPartitionLock(locallock->hashcode);
+ 
+ 	LWLockAcquire(partitionLock, LW_SHARED);
+ 
+ 	/*
+ 	 * We don't need to re-find the lock or proclock, since we kept their
+ 	 * addresses in the locallock table, and they couldn't have been removed
+ 	 * while we were holding a lock on them.
+ 	 */
+ 	lock = locallock->lock;
+ 	LOCK_PRINT("LockHasWaiters: found", lock, lockmode);
+ 	proclock = locallock->proclock;
+ 	PROCLOCK_PRINT("LockHasWaiters: found", proclock);
+ 
+ 	/*
+ 	 * Double-check that we are actually holding a lock of the type we want to
+ 	 * release.
+ 	 */
+ 	if (!(proclock->holdMask & LOCKBIT_ON(lockmode)))
+ 	{
+ 		PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
+ 		LWLockRelease(partitionLock);
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		RemoveLocalLock(locallock);
+ 		return false;
+ 	}
+ 
+ 	/*
+ 	 * Do the checking.
+ 	 */
+ 	if ((lockMethodTable->conflictTab[lockmode] & lock->waitMask) != 0)
+ 		hasWaiters = true;
+ 
+ 	LWLockRelease(partitionLock);
+ 
+ 	return hasWaiters;
+ }
+ 
  
  /*
   * LockAcquire -- Check for lock conflicts, sleep if conflict found,
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index a851758..7c14d92 100644
*** a/src/include/postmaster/autovacuum.h
--- b/src/include/postmaster/autovacuum.h
***************
*** 15,20 ****
--- 15,27 ----
  #define AUTOVACUUM_H
  
  
+ /* Truncate exclusive lock configuration */
+ #define AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL		20 /* ms */
+ #define AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50 /* ms */
+ #define AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT			5000 /* ms */
+ 
+ 
+ 
  /* GUC variables */
  extern bool autovacuum_start_daemon;
  extern int	autovacuum_max_workers;
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index de340c4..aa79eda 100644
*** a/src/include/storage/lmgr.h
--- b/src/include/storage/lmgr.h
*************** extern void UnlockRelationOid(Oid relid,
*** 31,36 ****
--- 31,37 ----
  extern void LockRelation(Relation relation, LOCKMODE lockmode);
  extern bool ConditionalLockRelation(Relation relation, LOCKMODE lockmode);
  extern void UnlockRelation(Relation relation, LOCKMODE lockmode);
+ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
  
  extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
  extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index e01a5c5..680a132 100644
*** a/src/include/storage/lock.h
--- b/src/include/storage/lock.h
*************** extern void LockReleaseAll(LOCKMETHODID 
*** 494,499 ****
--- 494,501 ----
  extern void LockReleaseSession(LOCKMETHODID lockmethodid);
  extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
  extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+ extern bool LockHasWaiters(const LOCKTAG *locktag,
+ 			LOCKMODE lockmode, bool sessionLock);
  extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
  				 LOCKMODE lockmode);
  extern void AtPrepare_Locks(void);
#40Kevin Grittner
kgrittn@mail.com
In reply to: Jan Wieck (#39)
Re: autovacuum truncate exclusive lock round two

Jan Wieck wrote:

Based on the discussion and what I feel is a consensus I have
created an updated patch that has no GUC at all. The hard coded
parameters in include/postmaster/autovacuum.h are

 AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
 AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
 AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */

Since these really aren't part of the external API and are only
referenced in vacuumlazy.c, it seems more appropriate to define
them there.

I gave that the worst workload I can think of. A pgbench (style)
application that throws about 10 transactions per second at it,
so that there is constantly the need to give up the lock due to
conflicting lock requests and then reacquiring it again. A
"cleanup" process is periodically moving old tuples from the
history table to an archive table, making history a rolling
window table. And a third job that 2-3 times per minute produces
a 10 second lasting transaction, forcing autovacuum to give up on
the lock reacquisition.

Even with that workload autovacuum slow but steady is chopping
away at the table.

Applies with minor offsets, builds without warning, and passes
`make check-world`. My tests based on your earlier posted test
script confirm the benefit.

There are some minor white-space issues; for example git diff
--color shows some trailing spaces in comments.

There are no docs, but since there are no user-visible changes in
behavior other than better performance and more prompt and reliable
trunction of tables where we were already doing so, it doesn't seem
like any new docs are needed. Due to the nature of the problem,
tests are tricky to run correctly and take a long time to run, so I
don't see how any regressions tests would be appropriate, either.

This patch seems ready for committer, and I would be comfortable
with making the minor changes I mention above and committing it.
I'll wait a day or two to allow any other comments or objections.

To summarize, there has been pathalogical behavior in an
infrequently-encountered corner case of autovacuum, wasting a lot
of resources indefinitely when it is encountered; this patch gives
a major performance improvement in in this situation without any
other user-visible change and without requiring any new GUCs. It
adds a new public function in the locking area to allow a process
to check whether a particular lock it is holding is blocking any
other process, and another to wrap that to make it easy to check
whether the lock held on a particular table is blocking another
process. It uses this new capability to be smarter about scheduling
autovacuum's truncation work, and to avoid throwing away
incremental progress in doing so.

As such, I don't think it would be crazy to back-patch this, but I
think it would be wise to allow it to be proven on master/9.3 for a
while before considering that.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#40)
1 attachment(s)
Re: autovacuum truncate exclusive lock round two

On 12/9/2012 2:37 PM, Kevin Grittner wrote:

Jan Wieck wrote:

Based on the discussion and what I feel is a consensus I have
created an updated patch that has no GUC at all. The hard coded
parameters in include/postmaster/autovacuum.h are

AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */

Since these really aren't part of the external API and are only
referenced in vacuumlazy.c, it seems more appropriate to define
them there.

I gave that the worst workload I can think of. A pgbench (style)
application that throws about 10 transactions per second at it,
so that there is constantly the need to give up the lock due to
conflicting lock requests and then reacquiring it again. A
"cleanup" process is periodically moving old tuples from the
history table to an archive table, making history a rolling
window table. And a third job that 2-3 times per minute produces
a 10 second lasting transaction, forcing autovacuum to give up on
the lock reacquisition.

Even with that workload autovacuum slow but steady is chopping
away at the table.

Applies with minor offsets, builds without warning, and passes
`make check-world`. My tests based on your earlier posted test
script confirm the benefit.

There are some minor white-space issues; for example git diff
--color shows some trailing spaces in comments.

Cleaned up all of those.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachments:

autovacuum-truncate-lock-4.difftext/x-patch; name=autovacuum-truncate-lock-4.diffDownload
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5036849..1686b88 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 52,62 ****
--- 52,64 ----
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
  #include "storage/lmgr.h"
+ #include "storage/proc.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
  #include "utils/pg_rusage.h"
  #include "utils/timestamp.h"
  #include "utils/tqual.h"
+ #include "portability/instr_time.h"
  
  
  /*
***************
*** 82,87 ****
--- 84,96 ----
   */
  #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
  
+ /*
+  * Truncate exclusive lock configuration
+  */
+ #define AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL     20 /* ms */
+ #define AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL      50 /* ms */
+ #define AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT            5000 /* ms */
+ 
  typedef struct LVRelStats
  {
  	/* hasindex = true means two-pass strategy; false means one-pass */
*************** typedef struct LVRelStats
*** 103,108 ****
--- 112,118 ----
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 ****
--- 203,210 ----
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
  	vacrelstats->old_rel_tuples = onerel->rd_rel->reltuples;
  	vacrelstats->num_index_scans = 0;
+ 	vacrelstats->pages_removed = 0;
+ 	vacrelstats->lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 ****
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
--- 271,288 ----
  						vacrelstats->hasindex,
  						new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats->lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  						 onerel->rd_rel->relisshared,
  						 new_rel_tuples);
+ 	else
+ 		vacstmt->options &= ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
*************** lazy_truncate_heap(Relation onerel, LVRe
*** 1257,1336 ****
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
! 	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
! 
! 	/*
! 	 * Now that we have exclusive lock, look to see if the rel has grown
! 	 * whilst we were vacuuming with non-exclusive lock.  If so, give up; the
! 	 * newly added pages presumably contain non-deletable tuples.
  	 */
! 	new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 	if (new_rel_pages != old_rel_pages)
  	{
  		/*
! 		 * Note: we intentionally don't update vacrelstats->rel_pages with the
! 		 * new rel size here.  If we did, it would amount to assuming that the
! 		 * new pages are empty, which is unlikely.	Leaving the numbers alone
! 		 * amounts to assuming that the new pages have the same tuple density
! 		 * as existing ones, which is less unlikely.
  		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Scan backwards from the end to verify that the end pages actually
! 	 * contain no tuples.  This is *necessary*, not optional, because other
! 	 * backends could have added tuples to these pages whilst we were
! 	 * vacuuming.
! 	 */
! 	new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
  
! 	if (new_rel_pages >= old_rel_pages)
! 	{
! 		/* can't do anything after all */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Okay to truncate.
! 	 */
! 	RelationTruncate(onerel, new_rel_pages);
  
! 	/*
! 	 * We can release the exclusive lock as soon as we have truncated.	Other
! 	 * backends can't safely access the relation until they have processed the
! 	 * smgr invalidation that smgrtruncate sent out ... but that should happen
! 	 * as part of standard invalidation processing once they acquire lock on
! 	 * the relation.
! 	 */
! 	UnlockRelation(onerel, AccessExclusiveLock);
  
! 	/*
! 	 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 	 * also touching reltuples, since the tuple count wasn't changed by the
! 	 * truncation.
! 	 */
! 	vacrelstats->rel_pages = new_rel_pages;
! 	vacrelstats->pages_removed = old_rel_pages - new_rel_pages;
  
! 	ereport(elevel,
! 			(errmsg("\"%s\": truncated %u to %u pages",
! 					RelationGetRelationName(onerel),
! 					old_rel_pages, new_rel_pages),
! 			 errdetail("%s.",
! 					   pg_rusage_show(&ru0))));
  }
  
  /*
--- 1277,1402 ----
  	BlockNumber old_rel_pages = vacrelstats->rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
+ 	int			lock_retry;
  
  	pg_rusage_init(&ru0);
  
  	/*
! 	 * Loop until no more truncating can be done.
  	 */
! 	do
  	{
  		/*
! 		 * We need full exclusive lock on the relation in order to do
! 		 * truncation.
! 		 * If we can't get it, give up rather than waiting --- we don't want to
! 		 * block other backends, and we don't want to deadlock (which is quite
! 		 * possible considering we already hold a lower-grade lock).
  		 */
! 		vacrelstats->lock_waiter_detected = false;
! 		lock_retry = 0;
! 		while (true)
! 		{
! 			if (ConditionalLockRelation(onerel, AccessExclusiveLock))
! 				break;
  
! 			/*
! 			 * Check for interrupts while trying to (re-)acquire
! 			 * the exclusive lock.
! 			 */
! 			CHECK_FOR_INTERRUPTS();
  
! 			if (++lock_retry > (AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT /
! 								AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL))
! 			{
! 				/*
! 				 * We failed to establish the lock in the specified
! 				 * number of retries. This means we give up truncating.
! 				 * Suppress the ANALYZE step. Doing an ANALYZE at
! 				 * this point will reset the dead_tuple_count in the
! 				 * stats collector, so we will not get called by the
! 				 * autovacuum launcher again to do the truncate.
! 				 */
! 				vacrelstats->lock_waiter_detected = true;
! 				ereport(LOG,
! 						(errmsg("automatic vacuum of table \"%s.%s.%s\": "
! 								"cannot (re)acquire exclusive "
! 								"lock for truncate scan",
! 								get_database_name(MyDatabaseId),
! 								get_namespace_name(RelationGetNamespace(onerel)),
! 								RelationGetRelationName(onerel))));
! 				return;
! 			}
  
! 			pg_usleep(AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL);
! 		}
  
! 		/*
! 		 * Now that we have exclusive lock, look to see if the rel has grown
! 		 * whilst we were vacuuming with non-exclusive lock.  If so, give up;
! 		 * the newly added pages presumably contain non-deletable tuples.
! 		 */
! 		new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 		if (new_rel_pages != old_rel_pages)
! 		{
! 			/*
! 			 * Note: we intentionally don't update vacrelstats->rel_pages
! 			 * with the new rel size here.  If we did, it would amount to
! 			 * assuming that the new pages are empty, which is unlikely.
! 			 * Leaving the numbers alone amounts to assuming that the new
! 			 * pages have the same tuple density as existing ones, which
! 			 * is less unlikely.
! 			 */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
  
! 		/*
! 		 * Scan backwards from the end to verify that the end pages actually
! 		 * contain no tuples.  This is *necessary*, not optional, because other
! 		 * backends could have added tuples to these pages whilst we were
! 		 * vacuuming.
! 		 */
! 		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
  
! 		if (new_rel_pages >= old_rel_pages)
! 		{
! 			/* can't do anything after all */
! 			UnlockRelation(onerel, AccessExclusiveLock);
! 			return;
! 		}
! 
! 		/*
! 		 * Okay to truncate.
! 		 */
! 		RelationTruncate(onerel, new_rel_pages);
! 
! 		/*
! 		 * We can release the exclusive lock as soon as we have truncated.
! 		 * Other backends can't safely access the relation until they have
! 		 * processed the smgr invalidation that smgrtruncate sent out ...
! 		 * but that should happen as part of standard invalidation
! 		 * processing once they acquire lock on the relation.
! 		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 
! 		/*
! 		 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 		 * also touching reltuples, since the tuple count wasn't changed by the
! 		 * truncation.
! 		 */
! 		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
! 		vacrelstats->rel_pages = new_rel_pages;
! 
! 		ereport(elevel,
! 				(errmsg("\"%s\": truncated %u to %u pages",
! 						RelationGetRelationName(onerel),
! 						old_rel_pages, new_rel_pages),
! 				 errdetail("%s.",
! 						   pg_rusage_show(&ru0))));
! 		old_rel_pages = new_rel_pages;
! 	} while (new_rel_pages > vacrelstats->nonempty_pages &&
! 			vacrelstats->lock_waiter_detected);
  }
  
  /*
*************** static BlockNumber
*** 1342,1347 ****
--- 1408,1419 ----
  count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  {
  	BlockNumber blkno;
+ 	instr_time	starttime;
+ 	instr_time	currenttime;
+ 	instr_time	elapsed;
+ 
+ 	/* Initialize the starttime if we check for conflicting lock requests */
+ 	INSTR_TIME_SET_CURRENT(starttime);
  
  	/* Strange coding of loop control is needed because blkno is unsigned */
  	blkno = vacrelstats->rel_pages;
*************** count_nondeletable_pages(Relation onerel
*** 1354,1359 ****
--- 1426,1462 ----
  		bool		hastup;
  
  		/*
+ 		 * Check if another process requests a lock on our relation.
+ 		 * We are holding an AccessExclusiveLock here, so they will
+ 		 * be waiting. We only do this in autovacuum_truncate_lock_check
+ 		 * millisecond intervals, and we only check if that interval
+ 		 * has elapsed once every 32 blocks to keep the number of
+ 		 * system calls and actual shared lock table lookups to a
+ 		 * minimum.
+ 		 */
+ 		if ((blkno % 32) == 0)
+ 		{
+ 			INSTR_TIME_SET_CURRENT(currenttime);
+ 			elapsed = currenttime;
+ 			INSTR_TIME_SUBTRACT(elapsed, starttime);
+ 			if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
+ 					>= AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ 			{
+ 				if (LockHasWaitersRelation(onerel, AccessExclusiveLock))
+ 				{
+ 					ereport(elevel,
+ 							(errmsg("\"%s\": suspending truncate "
+ 									"due to conflicting lock request",
+ 									RelationGetRelationName(onerel))));
+ 
+ 					vacrelstats->lock_waiter_detected = true;
+ 					return blkno;
+ 				}
+ 				starttime = currenttime;
+ 			}
+ 		}
+ 
+ 		/*
  		 * We don't insert a vacuum delay point here, because we have an
  		 * exclusive lock on the table which we want to hold for as short a
  		 * time as possible.  We still need to check for interrupts however.
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index a7786d0..e1fa74f 100644
*** a/src/backend/storage/lmgr/lmgr.c
--- b/src/backend/storage/lmgr/lmgr.c
*************** UnlockRelation(Relation relation, LOCKMO
*** 233,238 ****
--- 233,256 ----
  }
  
  /*
+  *		LockHasWaitersRelation
+  *
+  * This is a functiion to check if someone else is waiting on a
+  * lock, we are currently holding.
+  */
+ bool
+ LockHasWaitersRelation(Relation relation, LOCKMODE lockmode)
+ {
+ 	LOCKTAG		tag;
+ 
+ 	SET_LOCKTAG_RELATION(tag,
+ 						 relation->rd_lockInfo.lockRelId.dbId,
+ 						 relation->rd_lockInfo.lockRelId.relId);
+ 
+ 	return LockHasWaiters(&tag, lockmode, false);
+ }
+ 
+ /*
   *		LockRelationIdForSession
   *
   * This routine grabs a session-level lock on the target relation.	The
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0183443..ec4da20 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** ProcLockHashCode(const PROCLOCKTAG *proc
*** 538,543 ****
--- 538,635 ----
  	return lockhash;
  }
  
+ /*
+  * LockHasWaiters -- look up 'locktag' and check if releasing this
+  *		lock would wake up other processes waiting for it.
+  */
+ bool
+ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
+ {
+ 	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+ 	LockMethod	lockMethodTable;
+ 	LOCALLOCKTAG localtag;
+ 	LOCALLOCK  *locallock;
+ 	LOCK	   *lock;
+ 	PROCLOCK   *proclock;
+ 	LWLockId	partitionLock;
+ 	bool		hasWaiters = false;
+ 
+ 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+ 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+ 	lockMethodTable = LockMethods[lockmethodid];
+ 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
+ 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
+ 
+ #ifdef LOCK_DEBUG
+ 	if (LOCK_DEBUG_ENABLED(locktag))
+ 		elog(LOG, "LockHasWaiters: lock [%u,%u] %s",
+ 			 locktag->locktag_field1, locktag->locktag_field2,
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ #endif
+ 
+ 	/*
+ 	 * Find the LOCALLOCK entry for this lock and lockmode
+ 	 */
+ 	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+ 	localtag.lock = *locktag;
+ 	localtag.mode = lockmode;
+ 
+ 	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+ 										  (void *) &localtag,
+ 										  HASH_FIND, NULL);
+ 
+ 	/*
+ 	 * let the caller print its own error message, too. Do not ereport(ERROR).
+ 	 */
+ 	if (!locallock || locallock->nLocks <= 0)
+ 	{
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		return false;
+ 	}
+ 
+ 	/*
+ 	 * Check the shared lock table.
+ 	 */
+ 	partitionLock = LockHashPartitionLock(locallock->hashcode);
+ 
+ 	LWLockAcquire(partitionLock, LW_SHARED);
+ 
+ 	/*
+ 	 * We don't need to re-find the lock or proclock, since we kept their
+ 	 * addresses in the locallock table, and they couldn't have been removed
+ 	 * while we were holding a lock on them.
+ 	 */
+ 	lock = locallock->lock;
+ 	LOCK_PRINT("LockHasWaiters: found", lock, lockmode);
+ 	proclock = locallock->proclock;
+ 	PROCLOCK_PRINT("LockHasWaiters: found", proclock);
+ 
+ 	/*
+ 	 * Double-check that we are actually holding a lock of the type we want to
+ 	 * release.
+ 	 */
+ 	if (!(proclock->holdMask & LOCKBIT_ON(lockmode)))
+ 	{
+ 		PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
+ 		LWLockRelease(partitionLock);
+ 		elog(WARNING, "you don't own a lock of type %s",
+ 			 lockMethodTable->lockModeNames[lockmode]);
+ 		RemoveLocalLock(locallock);
+ 		return false;
+ 	}
+ 
+ 	/*
+ 	 * Do the checking.
+ 	 */
+ 	if ((lockMethodTable->conflictTab[lockmode] & lock->waitMask) != 0)
+ 		hasWaiters = true;
+ 
+ 	LWLockRelease(partitionLock);
+ 
+ 	return hasWaiters;
+ }
+ 
  
  /*
   * LockAcquire -- Check for lock conflicts, sleep if conflict found,
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index de340c4..aa79eda 100644
*** a/src/include/storage/lmgr.h
--- b/src/include/storage/lmgr.h
*************** extern void UnlockRelationOid(Oid relid,
*** 31,36 ****
--- 31,37 ----
  extern void LockRelation(Relation relation, LOCKMODE lockmode);
  extern bool ConditionalLockRelation(Relation relation, LOCKMODE lockmode);
  extern void UnlockRelation(Relation relation, LOCKMODE lockmode);
+ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
  
  extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
  extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index e01a5c5..680a132 100644
*** a/src/include/storage/lock.h
--- b/src/include/storage/lock.h
*************** extern void LockReleaseAll(LOCKMETHODID 
*** 494,499 ****
--- 494,501 ----
  extern void LockReleaseSession(LOCKMETHODID lockmethodid);
  extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
  extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+ extern bool LockHasWaiters(const LOCKTAG *locktag,
+ 			LOCKMODE lockmode, bool sessionLock);
  extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
  				 LOCKMODE lockmode);
  extern void AtPrepare_Locks(void);
#42Kevin Grittner
kgrittn@mail.com
In reply to: Jan Wieck (#41)
Re: autovacuum truncate exclusive lock round two

Jan Wieck wrote:

Cleaned up all of those.

Applied with trivial editing, mostly from a pgindent run against
modified files.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers